Computation and Language 84
☆ Arctic-TILT. Business Document Understanding at Sub-Billion Scale
Łukasz Borchmann, Michał Pietruszka, Wojciech Jaśkowski, Dawid Jurkiewicz, Piotr Halama, Paweł Józiak, Łukasz Garncarek, Paweł Liskowski, Karolina Szyndler, Andrzej Gretkowski, Julita Ołtusek, Gabriela Nowakowska, Artur Zawłocki, Łukasz Duhr, Paweł Dyda, Michał Turski
The vast portion of workloads employing LLMs involves answering questions
grounded on PDF or scan content. We introduce the Arctic-TILT achieving
accuracy on par with models 1000$\times$ its size on these use cases. It can be
fine-tuned and deployed on a single 24GB GPU, lowering operational costs while
processing Visually Rich Documents with up to 400k tokens. The model
establishes state-of-the-art results on seven diverse Document Understanding
benchmarks, as well as provides reliable confidence scores and quick inference,
which are essential for processing files in large-scale or time-sensitive
enterprise environments.
☆ LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
Standard natural language processing (NLP) pipelines operate on symbolic
representations of language, which typically consist of sequences of discrete
tokens. However, creating an analogous representation for ancient logographic
writing systems is an extremely labor intensive process that requires expert
knowledge. At present, a large portion of logographic data persists in a purely
visual form due to the absence of transcription -- this issue poses a
bottleneck for researchers seeking to apply NLP toolkits to study ancient
logographic languages: most of the relevant data are images of writing.
This paper investigates whether direct processing of visual representations
of language offers a potential solution. We introduce LogogramNLP, the first
benchmark enabling NLP analysis of ancient logographic languages, featuring
both transcribed and visual datasets for four writing systems along with
annotations for tasks like classification, translation, and parsing. Our
experiments compare systems that employ recent visual and text encoding
strategies as backbones. The results demonstrate that visual representations
outperform textual representations for some investigated tasks, suggesting that
visual processing pipelines may unlock a large amount of cultural heritage data
of logographic languages for NLP-based analyses.
☆ Transformer Explainer: Interactive Learning of Text-Generative Models IEEE VIS 2024
Aeree Cho, Grace C. Kim, Alexander Karpekov, Alec Helbling, Zijie J. Wang, Seongmin Lee, Benjamin Hoover, Duen Horng Chau
Transformers have revolutionized machine learning, yet their inner workings
remain opaque to many. We present Transformer Explainer, an interactive
visualization tool designed for non-experts to learn about Transformers through
the GPT-2 model. Our tool helps users understand complex Transformer concepts
by integrating a model overview and enabling smooth transitions across
abstraction levels of mathematical operations and model structures. It runs a
live GPT-2 instance locally in the user's browser, empowering users to
experiment with their own input and observe in real-time how the internal
components and parameters of the Transformer work together to predict the next
tokens. Our tool requires no installation or special hardware, broadening the
public's education access to modern generative AI techniques. Our open-sourced
tool is available at https://poloclub.github.io/transformer-explainer/. A video
demo is available at https://youtu.be/ECR4oAwocjs.
comment: To be presented at IEEE VIS 2024
☆ Better Alignment with Instruction Back-and-Forth Translation
We propose a new method, instruction back-and-forth translation, to construct
high-quality synthetic data grounded in world knowledge for aligning large
language models (LLMs). Given documents from a web corpus, we generate and
curate synthetic instructions using the backtranslation approach proposed by Li
et al.(2023a), and rewrite the responses to improve their quality further based
on the initial documents. Fine-tuning with the resulting (backtranslated
instruction, rewritten response) pairs yields higher win rates on AlpacaEval
than using other common instruction datasets such as Humpback, ShareGPT, Open
Orca, Alpaca-GPT4 and Self-instruct. We also demonstrate that rewriting the
responses with an LLM outperforms direct distillation, and the two generated
text distributions exhibit significant distinction in embedding space. Further
analysis shows that our backtranslated instructions are of higher quality than
other sources of synthetic instructions, while our responses are more diverse
and complex than those obtained from distillation. Overall we find that
instruction back-and-forth translation combines the best of both worlds --
making use of the information diversity and quantity found on the web, while
ensuring the quality of the responses which is necessary for effective
alignment.
☆ Code-switching in text and speech reveals information-theoretic audience design
In this work, we use language modeling to investigate the factors that
influence code-switching. Code-switching occurs when a speaker alternates
between one language variety (the primary language) and another (the secondary
language), and is widely observed in multilingual contexts. Recent work has
shown that code-switching is often correlated with areas of high information
load in the primary language, but it is unclear whether high primary language
load only makes the secondary language relatively easier to produce at
code-switching points (speaker-driven code-switching), or whether
code-switching is additionally used by speakers to signal the need for greater
attention on the part of listeners (audience-driven code-switching). In this
paper, we use bilingual Chinese-English online forum posts and transcripts of
spontaneous Chinese-English speech to replicate prior findings that high
primary language (Chinese) information load is correlated with switches to the
secondary language (English). We then demonstrate that the information load of
the English productions is even higher than that of meaning equivalent Chinese
alternatives, and these are therefore not easier to produce, providing evidence
of audience-driven influences in code-switching at the level of the
communication channel, not just at the sociolinguistic level, in both writing
and speech.
comment: Submitted to Journal of Memory and Language on 7 June 2024
☆ Towards Resilient and Efficient LLMs: A Comparative Study of Efficiency, Performance, and Adversarial Robustness
With the increasing demand for practical applications of Large Language
Models (LLMs), many attention-efficient models have been developed to balance
performance and computational cost. However, the adversarial robustness of
these models remains under-explored. In this work, we design a framework to
investigate the trade-off between efficiency, performance, and adversarial
robustness of LLMs by comparing three prominent models with varying levels of
complexity and efficiency -- Transformer++, Gated Linear Attention (GLA)
Transformer, and MatMul-Free LM -- utilizing the GLUE and AdvGLUE datasets. The
AdvGLUE dataset extends the GLUE dataset with adversarial samples designed to
challenge model robustness. Our results show that while the GLA Transformer and
MatMul-Free LM achieve slightly lower accuracy on GLUE tasks, they demonstrate
higher efficiency and either superior or comparative robustness on AdvGLUE
tasks compared to Transformer++ across different attack levels. These findings
highlight the potential of simplified architectures to achieve a compelling
balance between efficiency, performance, and adversarial robustness, offering
valuable insights for applications where resource constraints and resilience to
adversarial attacks are critical.
☆ SCENE: Evaluating Explainable AI Techniques Using Soft Counterfactuals
Explainable Artificial Intelligence (XAI) is essential for enhancing the
transparency and accountability of AI models, especially in natural language
processing (NLP) tasks. This paper introduces SCENE (Soft Counterfactual
Evaluation for Natural language Explainability), a novel evaluation method that
leverages large language models (LLMs) to generate Soft Counterfactual
explanations in a zero-shot manner. By focusing on token-based substitutions,
SCENE creates contextually appropriate and seman-tically meaningful Soft
Counterfactuals without extensive fine-tuning. SCENE adopts Validitysoft and
Csoft metrics to evaluate the effectiveness of model-agnostic XAI methods in
text classification tasks. Applied to CNN, RNN, and BERT architectures, SCENE
provides valuable insights into the strengths and limitations of various XAI
techniques.
comment: 10 pages, 5 tables
☆ Learning Fine-Grained Grounded Citations for Attributed Large Language Models ACL 2024
Lei Huang, Xiaocheng Feng, Weitao Ma, Yuxuan Gu, Weihong Zhong, Xiachong Feng, Weijiang Yu, Weihua Peng, Duyu Tang, Dandan Tu, Bing Qin
Despite the impressive performance on information-seeking tasks, large
language models (LLMs) still struggle with hallucinations. Attributed LLMs,
which augment generated text with in-line citations, have shown potential in
mitigating hallucinations and improving verifiability. However, current
approaches suffer from suboptimal citation quality due to their reliance on
in-context learning. Furthermore, the practice of citing only coarse document
identifiers makes it challenging for users to perform fine-grained
verification. In this work, we introduce FRONT, a training framework designed
to teach LLMs to generate Fine-Grained Grounded Citations. By grounding model
outputs in fine-grained supporting quotes, these quotes guide the generation of
grounded and consistent responses, not only improving citation quality but also
facilitating fine-grained verification. Experiments on the ALCE benchmark
demonstrate the efficacy of FRONT in generating superior grounded responses and
highly supportive citations. With LLaMA-2-7B, the framework significantly
outperforms all the baselines, achieving an average of 14.21% improvement in
citation quality across all datasets, even surpassing ChatGPT.
comment: Accepted by ACL 2024 Findings
☆ Conversational Prompt Engineering
Liat Ein-Dor, Orith Toledo-Ronen, Artem Spector, Shai Gretz, Lena Dankin, Alon Halfon, Yoav Katz, Noam Slonim
Prompts are how humans communicate with LLMs. Informative prompts are
essential for guiding LLMs to produce the desired output. However, prompt
engineering is often tedious and time-consuming, requiring significant
expertise, limiting its widespread use. We propose Conversational Prompt
Engineering (CPE), a user-friendly tool that helps users create personalized
prompts for their specific tasks. CPE uses a chat model to briefly interact
with users, helping them articulate their output preferences and integrating
these into the prompt. The process includes two main stages: first, the model
uses user-provided unlabeled data to generate data-driven questions and utilize
user responses to shape the initial instruction. Then, the model shares the
outputs generated by the instruction and uses user feedback to further refine
the instruction and the outputs. The final result is a few-shot prompt, where
the outputs approved by the user serve as few-shot examples. A user study on
summarization tasks demonstrates the value of CPE in creating personalized,
high-performing prompts. The results suggest that the zero-shot prompt obtained
is comparable to its - much longer - few-shot counterpart, indicating
significant savings in scenarios involving repetitive tasks with large text
volumes.
☆ Bias-Aware Low-Rank Adaptation: Mitigating Catastrophic Inheritance of Large Language Models
Large language models (LLMs) have exhibited remarkable proficiency across a
diverse array of natural language processing (NLP) tasks. However, adapting
LLMs to downstream applications typically necessitates computationally
intensive and memory-demanding fine-tuning procedures. To mitigate these
burdens, parameter-efficient fine-tuning (PEFT) techniques have emerged as a
promising approach to tailor LLMs with minimal computational overhead. While
PEFT methods offer substantial advantages, they do not fully address the
pervasive issue of bias propagation from pre-training data. In this work, we
introduce Bias-Aware Low-Rank Adaptation (BA-LoRA), a novel PEFT method
designed to counteract bias inheritance. BA-LoRA incorporates three distinct
regularization terms: (1) consistency regularizer, (2) diversity regularizer,
and (3) singular vector decomposition regularizer. These regularizers
collectively aim to improve the generative models' consistency, diversity, and
generalization capabilities during the fine-tuning process. Through extensive
experiments on a variety of natural language understanding (NLU) and natural
language generation (NLG) tasks, employing prominent LLMs such as LLaMA,
Mistral, and Gemma, we demonstrate that BA-LoRA surpasses the performance of
LoRA and its state-of-the-art variants. Moreover, our method effectively
mitigates the deleterious effects of pre-training bias, leading to more
reliable and robust model outputs. The code is available at
https://github.com/cyp-jlu-ai/BA-LoRA.
comment: Work in progress
☆ Molyé: A Corpus-based Approach to Language Contact in Colonial France
Whether or not several Creole languages which developed during the early
modern period can be considered genetic descendants of European languages has
been the subject of intense debate. This is in large part due to the absence of
evidence of intermediate forms. This work introduces a new open corpus, the
Moly\'e corpus, which combines stereotypical representations of three kinds of
language variation in Europe with early attestations of French-based Creole
languages across a period of 400 years. It is intended to facilitate future
research on the continuity between contact situations in Europe and Creolophone
(former) colonies.
comment: 8 main pages and 3 pages of references
☆ MemeMind at ArAIEval Shared Task: Spotting Persuasive Spans in Arabic Text with Persuasion Techniques Identification
This paper focuses on detecting propagandistic spans and persuasion
techniques in Arabic text from tweets and news paragraphs. Each entry in the
dataset contains a text sample and corresponding labels that indicate the start
and end positions of propaganda techniques within the text. Tokens falling
within a labeled span were assigned "B" (Begin) or "I" (Inside), "O",
corresponding to the specific propaganda technique. Using attention masks, we
created uniform lengths for each span and assigned BIO tags to each token based
on the provided labels. Then, we used AraBERT-base pre-trained model for Arabic
text tokenization and embeddings with a token classification layer to identify
propaganda techniques. Our training process involves a two-phase fine-tuning
approach. First, we train only the classification layer for a few epochs,
followed by full model fine-tuning, updating all parameters. This methodology
allows the model to adapt to the specific characteristics of the propaganda
detection task while leveraging the knowledge captured by the pre-trained
AraBERT model. Our approach achieved an F1 score of 0.2774, securing the 3rd
position in the leaderboard of Task 1.
☆ Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models ACL 2024
As diverse linguistic communities and users adopt large language models
(LLMs), assessing their safety across languages becomes critical. Despite
ongoing efforts to make LLMs safe, they can still be made to behave unsafely
with jailbreaking, a technique in which models are prompted to act outside
their operational guidelines. Research on LLM safety and jailbreaking, however,
has so far mostly focused on English, limiting our understanding of LLM safety
in other languages. We contribute towards closing this gap by investigating the
effectiveness of many-shot jailbreaking, where models are prompted with unsafe
demonstrations to induce unsafe behaviour, in Italian. To enable our analysis,
we create a new dataset of unsafe Italian question-answer pairs. With this
dataset, we identify clear safety vulnerabilities in four families of
open-weight LLMs. We find that the models exhibit unsafe behaviors even when
prompted with few unsafe demonstrations, and -- more alarmingly -- that this
tendency rapidly escalates with more demonstrations.
comment: Accepted at ACL 2024 (Student Research Workshop)
☆ Articulatory Configurations across Genders and Periods in French Radio and TV archives
This paper studies changes in articulatory configurations across genders and
periods using an inversion from acoustic to articulatory parameters. From a
diachronic corpus based on French media archives spanning 60 years from 1955 to
2015, automatic transcription and forced alignment allowed extracting the
central frame of each vowel. More than one million frames were obtained from
over a thousand speakers across gender and age categories. Their formants were
used from these vocalic frames to fit the parameters of Maeda's articulatory
model. Evaluations of the quality of these processes are provided. We focus
here on two parameters of Maeda's model linked to total vocal tract length: the
relative position of the larynx (higher for females) and the lips protrusion
(more protruded for males). Implications for voice quality across genders are
discussed. The effect across periods seems gender independent; thus, the
assertion that females lowered their pitch with time is not supported.
comment: accepted to InterSpeech 2024, Kos Island, Greece keywords : acoustic
to articulatory inversion, diachrony, gender, French, media
☆ Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate
Competitive debate is a comprehensive and complex computational argumentation
task. Large Language Models (LLMs) encounter hallucinations and lack
competitiveness in this task. To address these challenges, we introduce Agent
for Debate (Agent4Debate), a dynamic, multi-agent framework based on LLMs
designed to enhance their capabilities in competitive debate. Drawing
inspiration from human behavior in debate preparation and execution,
Agent4Debate employs a collaborative architecture where four specialized agents
(Searcher, Analyzer, Writer, and Reviewer) dynamically interact and cooperate.
These agents work throughout the debate process, covering multiple stages from
initial research and argument formulation to rebuttal and summary. To
comprehensively evaluate framework performance, we construct the Chinese Debate
Arena, comprising 66 carefully selected Chinese debate motions. We recruite ten
experienced human debaters and collect records of 200 debates involving
Agent4Debate, baseline models, and humans. The evaluation employs the Debatrix
automatic scoring system and professional human reviewers based on the
established Debatrix-Elo and Human-Elo ranking. Experimental results indicate
that the state-of-the-art Agent4Debate exhibits capabilities comparable to
those of humans. Furthermore, ablation studies demonstrate the effectiveness of
each component in the agent structure.
comment: 9 pages, 3 figures
☆ Crowd Intelligence for Early Misinformation Prediction on Social Media
Misinformation spreads rapidly on social media, causing serious damage by
influencing public opinion, promoting dangerous behavior, or eroding trust in
reliable sources. It spreads too fast for traditional fact-checking, stressing
the need for predictive methods. We introduce CROWDSHIELD, a crowd
intelligence-based method for early misinformation prediction. We hypothesize
that the crowd's reactions to misinformation reveal its accuracy. Furthermore,
we hinge upon exaggerated assertions/claims and replies with particular
positions/stances on the source post within a conversation thread. We employ
Q-learning to capture the two dimensions -- stances and claims. We utilize deep
Q-learning due to its proficiency in navigating complex decision spaces and
effectively learning network properties. Additionally, we use a
transformer-based encoder to develop a comprehensive understanding of both
content and context. This multifaceted approach helps ensure the model pays
attention to user interaction and stays anchored in the communication's
content. We propose MIST, a manually annotated misinformation detection Twitter
corpus comprising nearly 200 conversation threads with more than 14K replies.
In experiments, CROWDSHIELD outperformed ten baseline systems, achieving an
improvement of ~4% macro-F1 score. We conduct an ablation study and error
analysis to validate our proposed model's performance. The source code and
dataset are available at https://github.com/LCS2-IIITD/CrowdShield.git.
comment: This work has been submitted to the IEEE for possible publication
☆ AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in Multilingual Corpora
For centuries, writers have hidden messages in their texts as acrostics,
where initial letters of consecutive lines or paragraphs form meaningful words
or phrases. Scholars searching for acrostics manually can only focus on a few
authors at a time and often favor qualitative arguments in discussing
intentionally. We aim to put the study of acrostics on firmer statistical
footing by presenting AcrosticSleuth, a first-of-its-kind tool that
automatically identifies acrostics and ranks them by the probability that the
sequence of characters does not occur by chance (and therefore may have been
inserted intentionally). Acrostics are rare, so we formalize the problem as a
binary classification task in the presence of extreme class imbalance. To
evaluate AcrosticSleuth, we present the Acrostic Identification Dataset
(AcrostID), a collection of acrostics from the WikiSource online database.
Despite the class imbalance, AcrosticSleuth achieves F1 scores of 0.39, 0.59,
and 0.66 on French, English, and Russian subdomains of WikiSource,
respectively. We further demonstrate that AcrosticSleuth can identify
previously unknown high-profile instances of wordplay, such as the acrostic
spelling ARSPOETICA (``art of poetry") by Italian Humanist Albertino Mussato
and English philosopher Thomas Hobbes' signature in the opening paragraphs of
The Elements of Law.
☆ Recognizing Emotion Regulation Strategies from Human Behavior with Large Language Models
Philipp Müller, Alexander Heimerl, Sayed Muddashir Hossain, Lea Siegel, Jan Alexandersson, Patrick Gebhard, Elisabeth André, Tanja Schneeberger
Human emotions are often not expressed directly, but regulated according to
internal processes and social display rules. For affective computing systems,
an understanding of how users regulate their emotions can be highly useful, for
example to provide feedback in job interview training, or in psychotherapeutic
scenarios. However, at present no method to automatically classify different
emotion regulation strategies in a cross-user scenario exists. At the same
time, recent studies showed that instruction-tuned Large Language Models (LLMs)
can reach impressive performance across a variety of affect recognition tasks
such as categorical emotion recognition or sentiment analysis. While these
results are promising, it remains unclear to what extent the representational
power of LLMs can be utilized in the more subtle task of classifying users'
internal emotion regulation strategy. To close this gap, we make use of the
recently introduced \textsc{Deep} corpus for modeling the social display of the
emotion shame, where each point in time is annotated with one of seven
different emotion regulation classes. We fine-tune Llama2-7B as well as the
recently introduced Gemma model using Low-rank Optimization on prompts
generated from different sources of information on the \textsc{Deep} corpus.
These include verbal and nonverbal behavior, person factors, as well as the
results of an in-depth interview after the interaction. Our results show, that
a fine-tuned Llama2-7B LLM is able to classify the utilized emotion regulation
strategy with high accuracy (0.84) without needing access to data from
post-interaction interviews. This represents a significant improvement over
previous approaches based on Bayesian Networks and highlights the importance of
modeling verbal behavior in emotion regulation.
comment: Accepted to ACII'24
☆ Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning
Retrieval-Augmented Language Models (RALMs) have significantly improved
performance in open-domain question answering (QA) by leveraging external
knowledge. However, RALMs still struggle with unanswerable queries, where the
retrieved contexts do not contain the correct answer, and with conflicting
information, where different sources provide contradictory answers due to
imperfect retrieval. This study introduces an in-context learning-based
approach to enhance the reasoning capabilities of RALMs, making them more
robust in imperfect retrieval scenarios. Our method incorporates Machine
Reading Comprehension (MRC) demonstrations, referred to as cases, to boost the
model's capabilities to identify unanswerabilities and conflicts among the
retrieved contexts. Experiments on two open-domain QA datasets show that our
approach increases accuracy in identifying unanswerable and conflicting
scenarios without requiring additional fine-tuning. This work demonstrates that
in-context learning can effectively enhance the robustness of RALMs in
open-domain QA tasks.
comment: 10 pages, 2 figures
☆ Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset ACL 2024
This paper explores the question of how accurately current large language
models can perform logical reasoning in natural language, with an emphasis on
whether these models exhibit reasoning biases similar to humans. Specifically,
our study focuses on syllogistic reasoning, a form of deductive reasoning
extensively studied in cognitive science as a natural form of human reasoning.
We present a syllogism dataset called NeuBAROCO, which consists of syllogistic
reasoning problems in English and Japanese. This dataset was originally
designed for psychological experiments to assess human reasoning capabilities
using various forms of syllogisms. Our experiments with leading large language
models indicate that these models exhibit reasoning biases similar to humans,
along with other error tendencies. Notably, there is significant room for
improvement in reasoning problems where the relationship between premises and
hypotheses is neither entailment nor contradiction. We also present
experimental results and in-depth analysis using a new Chain-of-Thought
prompting method, which asks LLMs to translate syllogisms into abstract logical
expressions and then explain their reasoning process. Our analysis using this
method suggests that the primary limitations of LLMs lie in the reasoning
process itself rather than the interpretation of syllogisms.
comment: To appear in Findings of the Association for Computational
Linguistics: ACL 2024
☆ Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation
Developing questions that are pedagogically sound, relevant, and promote
learning is a challenging and time-consuming task for educators. Modern-day
large language models (LLMs) generate high-quality content across multiple
domains, potentially helping educators to develop high-quality questions.
Automated educational question generation (AEQG) is important in scaling online
education catering to a diverse student population. Past attempts at AEQG have
shown limited abilities to generate questions at higher cognitive levels. In
this study, we examine the ability of five state-of-the-art LLMs of different
sizes to generate diverse and high-quality questions of different cognitive
levels, as defined by Bloom's taxonomy. We use advanced prompting techniques
with varying complexity for AEQG. We conducted expert and LLM-based evaluations
to assess the linguistic and pedagogical relevance and quality of the
questions. Our findings suggest that LLms can generate relevant and
high-quality educational questions of different cognitive levels when prompted
with adequate information, although there is a significant variance in the
performance of the five LLms considered. We also show that automated evaluation
is not on par with human evaluation.
☆ Open-domain Implicit Format Control for Large Language Model Generation
Yiqun Yao, Wenjia Ma, Xuezhi Fang, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Jing Li, Aixin Sun, Yequan Wang
Controlling the format of outputs generated by large language models (LLMs)
is a critical functionality in various applications. Current methods typically
employ constrained decoding with rule-based automata or fine-tuning with
manually crafted format instructions, both of which struggle with open-domain
format requirements. To address this limitation, we introduce a novel framework
for controlled generation in LLMs, leveraging user-provided, one-shot QA pairs.
This study investigates LLMs' capabilities to follow open-domain, one-shot
constraints and replicate the format of the example answers. We observe that
this is a non-trivial problem for current LLMs. We also develop a dataset
collection methodology for supervised fine-tuning that enhances the open-domain
format control of LLMs without degrading output quality, as well as a benchmark
on which we evaluate both the helpfulness and format correctness of LLM
outputs. The resulting datasets, named OIFC-SFT, along with the related code,
will be made publicly available at https://github.com/cofe-ai/OIFC.
comment: 6 pages
☆ Overview of the NLPCC 2024 Shared Task on Chinese Metaphor Generation
This paper presents the results of the shared task on Chinese metaphor
generation, hosted at the 13th CCF Conference on Natural Language Processing
and Chinese Computing (NLPCC 2024). The goal of this shared task is to generate
Chinese metaphors using machine learning techniques and effectively identifying
basic components of metaphorical sentences. It is divided into two subtasks: 1)
Metaphor Generation, which involves creating a metaphor from a provided tuple
consisting of TENOR, GROUND, and VEHICLE. The goal here is to synthesize a
metaphor that connects the subject (i.e. TENOR) with the object (i.e. VEHICLE),
guided by the concept of the GROUND. 2) Metaphor Components Identification,
which extracts the most fitting TENORs, GROUNDs, and VEHICLEs from a
metaphorical sentence. This component requires the identification of the most
fitting metaphor elements that correspond to the specified grounds. In addition
to overall results, we report on the setup and insights from the metaphor
generation shared task, which attracted a total of 4 participating teams across
both subtasks.
☆ Analyzing Consumer Reviews for Understanding Drivers of Hotels Ratings: An Indian Perspective
In the internet era, almost every business entity is trying to have its
digital footprint in digital media and other social media platforms. For these
entities, word of mouse is also very important. Particularly, this is quite
crucial for the hospitality sector dealing with hotels, restaurants etc.
Consumers do read other consumers reviews before making final decisions. This
is where it becomes very important to understand which aspects are affecting
most in the minds of the consumers while giving their ratings. The current
study focuses on the consumer reviews of Indian hotels to extract aspects
important for final ratings. The study involves gathering data using web
scraping methods, analyzing the texts using Latent Dirichlet Allocation for
topic extraction and sentiment analysis for aspect-specific sentiment mapping.
Finally, it incorporates Random Forest to understand the importance of the
aspects in predicting the final rating of a user.
comment: This is the pre-print of the paper that was accepted for oral
presentation and publication in the proceedings of IEEE ICCCNT 2024 which was
organized as IIT Mandi, India from June 24 to 28, 2024. The paper is 5 pages
long and it contains 4 figures and 6 tables. The is not the final version of
the paper
☆ Simulating Articulatory Trajectories with Phonological Feature Interpolation
As a first step towards a complete computational model of speech learning
involving perception-production loops, we investigate the forward mapping
between pseudo-motor commands and articulatory trajectories. Two phonological
feature sets, based respectively on generative and articulatory phonology, are
used to encode a phonetic target sequence. Different interpolation techniques
are compared to generate smooth trajectories in these feature spaces, with a
potential optimisation of the target value and timing to capture
co-articulation effects. We report the Pearson correlation between a linear
projection of the generated trajectories and articulatory data derived from a
multi-speaker dataset of electromagnetic articulography (EMA) recordings. A
correlation of 0.67 is obtained with an extended feature set based on
generative phonology and a linear interpolation technique. We discuss the
implications of our results for our understanding of the dynamics of biological
motion.
comment: accepted at Interspeech 2024
☆ Enhancing Journalism with AI: A Study of Contextualized Image Captioning for News Articles using LLMs and LMMs
Large language models (LLMs) and large multimodal models (LMMs) have
significantly impacted the AI community, industry, and various economic
sectors. In journalism, integrating AI poses unique challenges and
opportunities, particularly in enhancing the quality and efficiency of news
reporting. This study explores how LLMs and LMMs can assist journalistic
practice by generating contextualised captions for images accompanying news
articles. We conducted experiments using the GoodNews dataset to evaluate the
ability of LMMs (BLIP-2, GPT-4v, or LLaVA) to incorporate one of two types of
context: entire news articles, or extracted named entities. In addition, we
compared their performance to a two-stage pipeline composed of a captioning
model (BLIP-2, OFA, or ViT-GPT2) with post-hoc contextualisation with LLMs
(GPT-4 or LLaMA). We assess a diversity of models, and we find that while the
choice of contextualisation model is a significant factor for the two-stage
pipelines, this is not the case in the LMMs, where smaller, open-source models
perform well compared to proprietary, GPT-powered ones. Additionally, we found
that controlling the amount of provided context enhances performance. These
results highlight the limitations of a fully automated approach and underscore
the necessity for an interactive, human-in-the-loop strategy.
☆ HydraFormer: One Encoder For All Subsampling Rates ICME 2024
In automatic speech recognition, subsampling is essential for tackling
diverse scenarios. However, the inadequacy of a single subsampling rate to
address various real-world situations often necessitates training and deploying
multiple models, consequently increasing associated costs. To address this
issue, we propose HydraFormer, comprising HydraSub, a Conformer-based encoder,
and a BiTransformer-based decoder. HydraSub encompasses multiple branches, each
representing a distinct subsampling rate, allowing for the flexible selection
of any branch during inference based on the specific use case. HydraFormer can
efficiently manage different subsampling rates, significantly reducing training
and deployment expenses. Experiments on AISHELL-1 and LibriSpeech datasets
reveal that HydraFormer effectively adapts to various subsampling rates and
languages while maintaining high recognition performance. Additionally,
HydraFormer showcases exceptional stability, sustaining consistent performance
under various initialization conditions, and exhibits robust transferability by
learning from pretrained single subsampling rate automatic speech recognition
models\footnote{Model code and scripts:
https://github.com/HydraFormer/hydraformer}.
comment: accepted by ICME 2024
☆ Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP
François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, Thomas Demeester
The development of monolingual language models for low and mid-resource
languages continues to be hindered by the difficulty in sourcing high-quality
training data. In this study, we present a novel cross-lingual vocabulary
transfer strategy, trans-tokenization, designed to tackle this challenge and
enable more efficient language adaptation. Our approach focuses on adapting a
high-resource monolingual LLM to an unseen target language by initializing the
token embeddings of the target language using a weighted average of
semantically similar token embeddings from the source language. For this, we
leverage a translation resource covering both the source and target languages.
We validate our method with the Tweeties, a series of trans-tokenized LLMs, and
demonstrate their competitive performance on various downstream tasks across a
small but diverse set of languages. Additionally, we introduce Hydra LLMs,
models with multiple swappable language modeling heads and embedding tables,
which further extend the capabilities of our trans-tokenization strategy. By
designing a Hydra LLM based on the multilingual model TowerInstruct, we
developed a state-of-the-art machine translation model for Tatar, in a
zero-shot manner, completely bypassing the need for high-quality parallel data.
This breakthrough is particularly significant for low-resource languages like
Tatar, where high-quality parallel data is hard to come by. By lowering the
data and time requirements for training high-quality models, our
trans-tokenization strategy allows for the development of LLMs for a wider
range of languages, especially those with limited resources. We hope that our
work will inspire further research and collaboration in the field of
cross-lingual vocabulary transfer and contribute to the empowerment of
languages on a global scale.
comment: Accepted at COLM 2024
☆ Are Social Sentiments Inherent in LLMs? An Empirical Study on Extraction of Inter-demographic Sentiments
Large language models (LLMs) are supposed to acquire unconscious human
knowledge and feelings, such as social common sense and biases, by training
models from large amounts of text. However, it is not clear how much the
sentiments of specific social groups can be captured in various LLMs. In this
study, we focus on social groups defined in terms of nationality, religion, and
race/ethnicity, and validate the extent to which sentiments between social
groups can be captured in and extracted from LLMs. Specifically, we input
questions regarding sentiments from one group to another into LLMs, apply
sentiment analysis to the responses, and compare the results with social
surveys. The validation results using five representative LLMs showed higher
correlations with relatively small p-values for nationalities and religions,
whose number of data points were relatively large. This result indicates that
the LLM responses including the inter-group sentiments align well with actual
social survey results.
☆ EMTeC: A Corpus of Eye Movements on Machine-Generated Texts
Lena Sophia Bolliger, Patrick Haller, Isabelle Caroline Rose Cretton, David Robert Reich, Tannon Kew, Lena Ann Jäger
The Eye Movements on Machine-Generated Texts Corpus (EMTeC) is a naturalistic
eye-movements-while-reading corpus of 107 native English speakers reading
machine-generated texts. The texts are generated by three large language models
using five different decoding strategies, and they fall into six different text
type categories. EMTeC entails the eye movement data at all stages of
pre-processing, i.e., the raw coordinate data sampled at 2000 Hz, the fixation
sequences, and the reading measures. It further provides both the original and
a corrected version of the fixation sequences, accounting for vertical
calibration drift. Moreover, the corpus includes the language models' internals
that underlie the generation of the stimulus texts: the transition scores, the
attention scores, and the hidden states. The stimuli are annotated for a range
of linguistic features both at text and at word level. We anticipate EMTeC to
be utilized for a variety of use cases such as, but not restricted to, the
investigation of reading behavior on machine-generated text and the impact of
different decoding strategies; reading behavior on different text types; the
development of new pre-processing, data filtering, and drift correction
algorithms; the cognitive interpretability and enhancement of language models;
and the assessment of the predictive power of surprisal and entropy for human
reading times. The data at all stages of pre-processing, the model internals,
and the code to reproduce the stimulus generation, data pre-processing and
analyses can be accessed via https://github.com/DiLi-Lab/EMTeC/.
☆ LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection
Mervat Abassy, Kareem Elozeiri, Alexander Aziz, Minh Ngoc Ta, Raj Vardhan Tomar, Bimarsha Adhikari, Saad El Dine Ahmed, Yuxia Wang, Osama Mohammed Afzal, Zhuohan Xie, Jonibek Mansurov, Ekaterina Artemova, Vladislav Mikhailov, Rui Xing, Jiahui Geng, Hasan Iqbal, Zain Muhammad Mujahid, Tarek Mahmoud, Akim Tsvigun, Alham Fikri Aji, Artem Shelmanov, Nizar Habash, Iryna Gurevych, Preslav Nakov
The widespread accessibility of large language models (LLMs) to the general
public has significantly amplified the dissemination of machine-generated texts
(MGTs). Advancements in prompt manipulation have exacerbated the difficulty in
discerning the origin of a text (human-authored vs machinegenerated). This
raises concerns regarding the potential misuse of MGTs, particularly within
educational and academic domains. In this paper, we present
$\textbf{LLM-DetectAIve}$ -- a system designed for fine-grained MGT detection.
It is able to classify texts into four categories: human-written,
machine-generated, machine-written machine-humanized, and human-written
machine-polished. Contrary to previous MGT detectors that perform binary
classification, introducing two additional categories in LLM-DetectiAIve offers
insights into the varying degrees of LLM intervention during the text creation.
This might be useful in some domains like education, where any LLM intervention
is usually prohibited. Experiments show that LLM-DetectAIve can effectively
identify the authorship of textual content, proving its usefulness in enhancing
integrity in education, academia, and other domains. LLM-DetectAIve is publicly
accessible at https://huggingface.co/spaces/raj-tomar001/MGT-New. The video
describing our system is available at https://youtu.be/E8eT_bE7k8c.
☆ LaDiMo: Layer-wise Distillation Inspired MoEfier
The advent of large language models has revolutionized natural language
processing, but their increasing complexity has led to substantial training
costs, resource demands, and environmental impacts. In response, sparse
Mixture-of-Experts (MoE) models have emerged as a promising alternative to
dense models. Since training MoE models from scratch can be prohibitively
expensive, recent studies have explored leveraging knowledge from pre-trained
non-MoE models. However, existing approaches have limitations, such as
requiring significant hardware resources and data. We propose a novel
algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model
into a MoE model with minimal additional training cost. LaDiMo consists of two
stages: layer-wise expert construction and routing policy decision. By
harnessing the concept of Knowledge Distillation, we compress the model and
rapidly recover its performance. Furthermore, we develop an adaptive router
that optimizes inference efficiency by profiling the distribution of routing
weights and determining a layer-wise policy that balances accuracy and latency.
We demonstrate the effectiveness of our method by converting the LLaMA2-7B
model to a MoE model using only 100K tokens, reducing activated parameters by
over 20% while keeping accuracy. Our approach offers a flexible and efficient
solution for building and deploying MoE models.
comment: 21 pages, 10 figures
☆ Analysis of Argument Structure Constructions in the Large Language Model BERT
This study investigates how BERT processes and represents Argument Structure
Constructions (ASCs), extending previous LSTM analyses. Using a dataset of 2000
sentences across four ASC types (transitive, ditransitive, caused-motion,
resultative), we analyzed BERT's token embeddings across 12 layers.
Visualizations with MDS and t-SNE and clustering quantified by Generalized
Discrimination Value (GDV) were used. Feedforward classifiers (probes)
predicted construction categories from embeddings. CLS token embeddings
clustered best in layers 2-4, decreased in intermediate layers, and slightly
increased in final layers. DET and SUBJ embeddings showed consistent clustering
in intermediate layers, VERB embeddings increased in clustering from layer 1 to
12, and OBJ embeddings peaked in layer 10. Probe accuracies indicated low
construction information in layer 1, with over 90 percent accuracy from layer 2
onward, revealing latent construction information beyond GDV clustering. Fisher
Discriminant Ratio (FDR) analysis of attention weights showed OBJ tokens were
crucial for differentiating ASCs, followed by VERB and DET tokens. SUBJ, CLS,
and SEP tokens had insignificant FDR scores. This study highlights BERT's
layered processing of linguistic constructions and its differences from LSTMs.
Future research will compare these findings with neuroimaging data to
understand the neural correlates of ASC processing. This research underscores
neural language models' potential to mirror linguistic processing in the human
brain, offering insights into the computational and neural mechanisms
underlying language understanding.
comment: arXiv admin note: text overlap with arXiv:2408.03062
☆ EfficientRAG: Efficient Retriever for Multi-Hop Question Answering
Ziyuan Zhuang, Zhiyang Zhang, Sitao Cheng, Fangkai Yang, Jia Liu, Shujian Huang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Retrieval-augmented generation (RAG) methods encounter difficulties when
addressing complex questions like multi-hop queries. While iterative retrieval
methods improve performance by gathering additional information, current
approaches often rely on multiple calls of large language models (LLMs). In
this paper, we introduce EfficientRAG, an efficient retriever for multi-hop
question answering. EfficientRAG iteratively generates new queries without the
need for LLM calls at each iteration and filters out irrelevant information.
Experimental results demonstrate that EfficientRAG surpasses existing RAG
methods on three open-domain multi-hop question-answering datasets.
comment: 20 pages, 4 figures
☆ Explicating the Implicit: Argument Detection Beyond Sentence Boundaries ACL 2024
Detecting semantic arguments of a predicate word has been conventionally
modeled as a sentence-level task. The typical reader, however, perfectly
interprets predicate-argument relations in a much wider context than just the
sentence where the predicate was evoked. In this work, we reformulate the
problem of argument detection through textual entailment to capture semantic
relations across sentence boundaries. We propose a method that tests whether
some semantic relation can be inferred from a full passage by first encoding it
into a simple and standalone proposition and then testing for entailment
against the passage. Our method does not require direct supervision, which is
generally absent due to dataset scarcity, but instead builds on existing NLI
and sentence-level SRL resources. Such a method can potentially explicate
pragmatically understood relations into a set of explicit sentences. We
demonstrate it on a recent document-level benchmark, outperforming some
supervised methods and contemporary language models.
comment: 9 pages, ACL 2024
☆ Learning to Rewrite: Generalized LLM-Generated Text Detection
Large language models (LLMs) can be abused at scale to create non-factual
content and spread disinformation. Detecting LLM-generated content is essential
to mitigate these risks, but current classifiers often fail to generalize in
open-world contexts. Prior work shows that LLMs tend to rewrite LLM-generated
content less frequently, which can be used for detection and naturally
generalizes to unforeseen data. However, we find that the rewriting edit
distance between human and LLM content can be indistinguishable across domains,
leading to detection failures. We propose training an LLM to rewrite input
text, producing minimal edits for LLM-generated content and more edits for
human-written text, deriving a distinguishable and generalizable edit distance
difference across different domains. Experiments on text from 21 independent
domains and three popular LLMs (e.g., GPT-4o, Gemini, and Llama-3) show that
our classifier outperforms the state-of-the-art zero-shot classifier by up to
20.6% on AUROC score and the rewriting classifier by 9.2% on F1 score. Our work
suggests that LLM can effectively detect machine-generated text if they are
trained properly.
☆ Evaluating Language Model Math Reasoning via Grounding in Educational Curricula
Our work presents a novel angle for evaluating language models' (LMs)
mathematical abilities, by investigating whether they can discern skills and
concepts enabled by math content. We contribute two datasets: one consisting of
385 fine-grained descriptions of K-12 math skills and concepts, or standards,
from Achieve the Core (ATC), and another of 9.9K problems labeled with these
standards (MathFish). Working with experienced teachers, we find that LMs
struggle to tag and verify standards linked to problems, and instead predict
labels that are close to ground truth, but differ in subtle ways. We also show
that LMs often generate problems that do not fully align with standards
described in prompts. Finally, we categorize problems in GSM8k using math
standards, allowing us to better understand why some problems are more
difficult to solve for models than others.
comment: 30 pages, 23 figures
☆ Diffusion Guided Language Modeling ACL
Current language models demonstrate remarkable proficiency in text
generation. However, for many applications it is desirable to control
attributes, such as sentiment, or toxicity, of the generated language --
ideally tailored towards each specific use case and target audience. For
auto-regressive language models, existing guidance methods are prone to
decoding errors that cascade during generation and degrade performance. In
contrast, text diffusion models can easily be guided with, for example, a
simple linear sentiment classifier -- however they do suffer from significantly
higher perplexity than auto-regressive alternatives. In this paper we use a
guided diffusion model to produce a latent proposal that steers an
auto-regressive language model to generate text with desired properties. Our
model inherits the unmatched fluency of the auto-regressive approach and the
plug-and-play flexibility of diffusion. We show that it outperforms previous
plug-and-play guidance methods across a wide range of benchmark data sets.
Further, controlling a new attribute in our framework is reduced to training a
single logistic regression classifier.
comment: ACL Findings 2024
☆ Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs ACL 2024
In recent years, neural machine translation (NMT) has been widely used in
everyday life. However, the current NMT lacks a mechanism to adjust the
difficulty level of translations to match the user's language level.
Additionally, due to the bias in the training data for NMT, translations of
simple source sentences are often produced with complex words. In particular,
this could pose a problem for children, who may not be able to understand the
meaning of the translations correctly. In this study, we propose a method that
replaces words with high Age of Acquisitions (AoA) in translations with simpler
words to match the translations to the user's level. We achieve this by using
large language models (LLMs), providing a triple of a source sentence, a
translation, and a target word to be replaced. We create a benchmark dataset
using back-translation on Simple English Wikipedia. The experimental results
obtained from the dataset show that our method effectively replaces high-AoA
words with lower-AoA words and, moreover, can iteratively replace most of the
high-AoA words while still maintaining high BLEU and COMET scores.
comment: Findings of ACL 2024
☆ Attention Mechanism and Context Modeling System for Text Mining Machine Translation
This paper advances a novel architectural schema anchored upon the
Transformer paradigm and innovatively amalgamates the K-means categorization
algorithm to augment the contextual apprehension capabilities of the schema.
The transformer model performs well in machine translation tasks due to its
parallel computing power and multi-head attention mechanism. However, it may
encounter contextual ambiguity or ignore local features when dealing with
highly complex language structures. To circumvent this constraint, this
exposition incorporates the K-Means algorithm, which is used to stratify the
lexis and idioms of the input textual matter, thereby facilitating superior
identification and preservation of the local structure and contextual
intelligence of the language. The advantage of this combination is that K-Means
can automatically discover the topic or concept regions in the text, which may
be directly related to translation quality. Consequently, the schema contrived
herein enlists K-Means as a preparatory phase antecedent to the Transformer and
recalibrates the multi-head attention weights to assist in the discrimination
of lexis and idioms bearing analogous semantics or functionalities. This
ensures the schema accords heightened regard to the contextual intelligence
embodied by these clusters during the training phase, rather than merely
focusing on locational intelligence.
☆ MMREC: LLM Based Multi-Modal Recommender System
The importance of recommender systems is growing rapidly due to the
exponential increase in the volume of content generated daily. This surge in
content presents unique challenges for designing effective recommender systems.
Key among these challenges is the need to effectively leverage the vast amounts
of natural language data and images that represent user preferences. This paper
presents a novel approach to enhancing recommender systems by leveraging Large
Language Models (LLMs) and deep learning techniques. The proposed framework
aims to improve the accuracy and relevance of recommendations by incorporating
multi-modal information processing and by the use of unified latent space
representation. The study explores the potential of LLMs to better understand
and utilize natural language data in recommendation contexts, addressing the
limitations of previous methods. The framework efficiently extracts and
integrates text and image information through LLMs, unifying diverse modalities
in a latent space to simplify the learning process for the ranking model.
Experimental results demonstrate the enhanced discriminative power of the model
when utilizing multi-modal information. This research contributes to the
evolving field of recommender systems by showcasing the potential of LLMs and
multi-modal data integration to create more personalized and contextually
relevant recommendations.
☆ wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech
Knowledge graphs (KGs) enhance the performance of large language models
(LLMs) and search engines by providing structured, interconnected data that
improves reasoning and context-awareness. However, KGs only focus on text data,
thereby neglecting other modalities such as speech. In this work, we introduce
wav2graph, the first framework for supervised learning knowledge graph from
speech data. Our pipeline are straightforward: (1) constructing a KG based on
transcribed spoken utterances and a named entity database, (2) converting KG
into embedding vectors, and (3) training graph neural networks (GNNs) for node
classification and link prediction tasks. Through extensive experiments
conducted in inductive and transductive learning contexts using
state-of-the-art GNN models, we provide baseline results and error analysis for
node classification and link prediction tasks on human transcripts and
automatic speech recognition (ASR) transcripts, including evaluations using
both encoder-based and decoder-based node embeddings, as well as monolingual
and multilingual acoustic pre-trained models. All related code, data, and
models are published online.
comment: Preprint, 32 pages
☆ mbrs: A Library for Minimum Bayes Risk Decoding
Minimum Bayes risk (MBR) decoding is a decision rule of text generation tasks
that outperforms conventional maximum a posterior (MAP) decoding using beam
search by selecting high-quality outputs based on a utility function rather
than those with high-probability. Typically, it finds the most suitable
hypothesis from the set of hypotheses under the sampled pseudo-references. mbrs
is a library of MBR decoding, which can flexibly combine various metrics,
alternative expectation estimations, and algorithmic variants. It is designed
with a focus on speed measurement and calling count of code blocks,
transparency, reproducibility, and extensibility, which are essential for
researchers and developers. We published our mbrs as an MIT-licensed
open-source project, and the code is available on GitHub.
GitHub: https://github.com/naist-nlp/mbrs
☆ Semantics or spelling? Probing contextual word embeddings with orthographic noise
Pretrained language model (PLM) hidden states are frequently employed as
contextual word embeddings (CWE): high-dimensional representations that encode
semantic information given linguistic context. Across many areas of
computational linguistics research, similarity between CWEs is interpreted as
semantic similarity. However, it remains unclear exactly what information is
encoded in PLM hidden states. We investigate this practice by probing PLM
representations using minimal orthographic noise. We expect that if CWEs
primarily encode semantic information, a single character swap in the input
word will not drastically affect the resulting representation,given sufficient
linguistic context. Surprisingly, we find that CWEs generated by popular PLMs
are highly sensitive to noise in input data, and that this sensitivity is
related to subword tokenization: the fewer tokens used to represent a word at
input, the more sensitive its corresponding CWE. This suggests that CWEs
capture information unrelated to word-level meaning and can be manipulated
through trivial modifications of input data. We conclude that these PLM-derived
CWEs may not be reliable semantic proxies, and that caution is warranted when
interpreting representational similarity
☆ UNLEARN Efficient Removal of Knowledge in Large Language Models
Given the prevalence of large language models (LLMs) and the prohibitive cost
of training these models from scratch, dynamically forgetting specific
knowledge e.g., private or proprietary, without retraining the model has become
an important capability. This paper proposes a novel method to achieve this
objective called UNLEARN. The approach builds upon subspace methods to identify
and specifically target the removal of knowledge without adversely affecting
other knowledge in the LLM. Results demonstrate 96% of targeted knowledge can
be forgotten while maintaining performance on other knowledge within 2.5% of
the original model, significantly outperforming the discriminatory abilities of
the previous state-of-the-art. A dual method called LEARN is also proposed for
targeted knowledge addition. Results show LEARN can match the fine-tuning
accuracy of Low-Rank Adaptation (LoRA) without adversely affecting similar
tasks.
comment: 11 pages, 2 Figures
☆ Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering
In recent years, the application of Large Language Models (LLMs) in
healthcare has shown significant promise in improving the accessibility and
dissemination of medical knowledge. This paper presents a detailed study of
various LLMs trained on the MedQuAD medical question-answering dataset, with a
focus on identifying the most effective model for providing accurate medical
information. Among the models tested, the Sentence-t5 combined with Mistral 7B
demonstrated superior performance, achieving a precision score of 0.762. This
model's enhanced capabilities are attributed to its advanced pretraining
techniques, robust architecture, and effective prompt construction
methodologies. By leveraging these strengths, the Sentence-t5 + Mistral 7B
model excels in understanding and generating precise medical answers. Our
findings highlight the potential of integrating sophisticated LLMs in medical
contexts to facilitate efficient and accurate medical knowledge retrieval, thus
significantly enhancing patient education and support.
comment: received by IEEE ICPICS
♻ ☆ Know Your Limits: A Survey of Abstention in Large Language Models
Abstention, the refusal of large language models (LLMs) to provide an answer,
is increasingly recognized for its potential to mitigate hallucinations and
enhance safety in LLM systems. In this survey, we introduce a framework to
examine abstention from three perspectives: the query, the model, and human
values. We organize the literature on abstention methods, benchmarks, and
evaluation metrics using this framework, and discuss merits and limitations of
prior work. We further identify and motivate areas for future work, centered
around whether abstention can be achieved as a meta-capability that transcends
specific tasks or domains, while still providing opportunities to optimize
abstention abilities based on context.
comment: preprint
♻ ☆ Self-Taught Evaluators
Tianlu Wang, Ilia Kulikov, Olga Golovneva, Ping Yu, Weizhe Yuan, Jane Dwivedi-Yu, Richard Yuanzhe Pang, Maryam Fazel-Zarandi, Jason Weston, Xian Li
Model-based evaluation is at the heart of successful model development -- as
a reward model for training, and as a replacement for human evaluation. To
train such evaluators, the standard approach is to collect a large amount of
human preference judgments over model responses, which is costly and the data
becomes stale as models improve. In this work, we present an approach that aims
to im-prove evaluators without human annotations, using synthetic training data
only. Starting from unlabeled instructions, our iterative self-improvement
scheme generates contrasting model outputs and trains an LLM-as-a-Judge to
produce reasoning traces and final judgments, repeating this training at each
new iteration using the improved predictions. Without any labeled preference
data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct)
from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms
commonly used LLM judges such as GPT-4 and matches the performance of the
top-performing reward models trained with labeled examples.
♻ ☆ Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting
Recent work in behavioral testing for natural language processing (NLP)
models, such as Checklist, is inspired by related paradigms in software
engineering testing. They allow evaluation of general linguistic capabilities
and domain understanding, hence can help evaluate conceptual soundness and
identify model weaknesses. However, a major challenge is the creation of test
cases. The current packages rely on semi-automated approach using manual
development which requires domain expertise and can be time consuming. This
paper introduces an automated approach to develop test cases by exploiting the
power of large language models and statistical techniques. It clusters the text
representations to carefully construct meaningful groups and then apply
prompting techniques to automatically generate Minimal Functionality Tests
(MFT). The well-known Amazon Reviews corpus is used to demonstrate our
approach. We analyze the behavioral test profiles across four different
classification algorithms and discuss the limitations and strengths of those
models.
♻ ☆ Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation
Diffusion-based text-to-image generation models trained on extensive
text-image pairs have shown the capacity to generate photorealistic images
consistent with textual descriptions. However, a significant limitation of
these models is their slow sample generation, which requires iterative
refinement through the same network. In this paper, we enhance Score identity
Distillation (SiD) by developing long and short classifier-free guidance (LSG)
to efficiently distill pretrained Stable Diffusion models without using real
training data. SiD aims to optimize a model-based explicit score matching loss,
utilizing a score-identity-based approximation alongside the proposed LSG for
practical computation. By training exclusively with fake images synthesized
with its one-step generator, SiD equipped with LSG rapidly improves FID and
CLIP scores, achieving state-of-the-art FID performance while maintaining a
competitive CLIP score. Specifically, its data-free distillation of Stable
Diffusion 1.5 achieves a record low FID of 8.15 on the COCO-2014 validation
set, with a CLIP score of 0.304 at an LSG scale of 1.5, and an FID of 9.56 with
a CLIP score of 0.313 at an LSG scale of 2. Our code and distilled one-step
text-to-image generators are available at
https://github.com/mingyuanzhou/SiD-LSG.
comment: Code and model checkpoints available at
https://github.com/mingyuanzhou/SiD-LSG
♻ ☆ DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents ACL 2024
Yilun Zhao, Yitao Long, Hongjun Liu, Ryo Kamoi, Linyong Nan, Lyuhao Chen, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan
Recent LLMs have demonstrated remarkable performance in solving exam-like
math word problems. However, the degree to which these numerical reasoning
skills are effective in real-world scenarios, particularly in expert domains,
is still largely unexplored. This paper introduces DocMath-Eval, a
comprehensive benchmark specifically designed to evaluate the numerical
reasoning capabilities of LLMs in the context of understanding and analyzing
specialized documents containing both text and tables. We evaluate a wide
spectrum of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting
methods, aiming to comprehensively assess the capabilities and limitations of
existing LLMs in DocMath-Eval. We found that even the current best-performing
system (i.e., GPT-4o) still significantly lags behind human experts in solving
complex numerical reasoning problems grounded in long contexts. We believe that
DocMath-Eval can serve as a valuable benchmark for evaluating LLMs'
capabilities in solving challenging numerical reasoning problems within expert
domains.
comment: ACL 2024 Oral. arXiv admin note: text overlap with arXiv:2311.09797
♻ ☆ FinanceMath: Knowledge-Intensive Math Reasoning in Finance Domains ACL 2024
We introduce FinanceMath, a novel benchmark designed to evaluate LLMs'
capabilities in solving knowledge-intensive math reasoning problems. Compared
to prior works, this study features three core advancements. First, FinanceMath
includes 1,200 problems with a hybrid of textual and tabular content. These
problems require college-level knowledge in the finance domain for effective
resolution. Second, we provide expert-annotated, detailed solution references
in Python program format, ensuring a high-quality benchmark for LLM assessment.
We also construct a finance-domain knowledge bank and investigate various
knowledge integration strategies. Finally, we evaluate a wide spectrum of 44
LLMs with both Chain-of-Thought and Program-of-Thought prompting methods. Our
experimental results reveal that the current best-performing system (i.e.,
GPT-4o) achieves only 60.9% accuracy using CoT prompting, leaving substantial
room for improvement. Moreover, while augmenting LLMs with external knowledge
can improve model performance (e.g., from 47.5% to 54.5% for Gemini-1.5-Pro),
their accuracy remains significantly lower than the estimated human expert
performance of 92%. We believe that FinanceMath can advance future research in
the area of domain-specific knowledge retrieval and integration, particularly
within the context of solving reasoning-intensive tasks.
comment: ACL 2024 Oral
♻ ☆ Large Language Models are Capable of Offering Cognitive Reappraisal, if Guided
Large language models (LLMs) have offered new opportunities for emotional
support, and recent work has shown that they can produce empathic responses to
people in distress. However, long-term mental well-being requires emotional
self-regulation, where a one-time empathic response falls short. This work
takes a first step by engaging with cognitive reappraisals, a strategy from
psychology practitioners that uses language to targetedly change negative
appraisals that an individual makes of the situation; such appraisals is known
to sit at the root of human emotional experience. We hypothesize that
psychologically grounded principles could enable such advanced psychology
capabilities in LLMs, and design RESORT which consists of a series of
reappraisal constitutions across multiple dimensions that can be used as LLM
instructions. We conduct a first-of-its-kind expert evaluation (by clinical
psychologists with M.S. or Ph.D. degrees) of an LLM's zero-shot ability to
generate cognitive reappraisal responses to medium-length social media messages
asking for support. This fine-grained evaluation showed that even LLMs at the
7B scale guided by RESORT are capable of generating empathic responses that can
help users reappraise their situations.
comment: Accepted to COLM 2024
♻ ☆ An Autonomous GIS Agent Framework for Geospatial Data Retrieval
Powered by the emerging large language models (LLMs), autonomous geographic
information systems (GIS) agents have the potential to accomplish spatial
analyses and cartographic tasks. However, a research gap exists to support
fully autonomous GIS agents: how to enable agents to discover and download the
necessary data for geospatial analyses. This study proposes an autonomous GIS
agent framework capable of retrieving required geospatial data by generating,
executing, and debugging programs. The framework utilizes the LLM as the
decision-maker, selects the appropriate data source (s) from a pre-defined
source list, and fetches the data from the chosen source. Each data source has
a handbook that records the metadata and technical details for data retrieval.
The proposed framework is designed in a plug-and-play style to ensure
flexibility and extensibility. Human users or autonomous data scrawlers can add
new data sources by adding new handbooks. We developed a prototype agent based
on the framework, released as a QGIS plugin (GeoData Retrieve Agent) and a
Python program. Experiment results demonstrate its capability of retrieving
data from various sources including OpenStreetMap, administrative boundaries
and demographic data from the US Census Bureau, satellite basemaps from ESRI
World Imagery, global digital elevation model (DEM) from OpenTopography.org,
weather data from a commercial provider, the COVID-19 cases from the NYTimes
GitHub. Our study is among the first attempts to develop an autonomous
geospatial data retrieval agent.
♻ ☆ Duwak: Dual Watermarks in Large Language Models
As large language models (LLM) are increasingly used for text generation
tasks, it is critical to audit their usages, govern their applications, and
mitigate their potential harms. Existing watermark techniques are shown
effective in embedding single human-imperceptible and machine-detectable
patterns without significantly affecting generated text quality and semantics.
However, the efficiency in detecting watermarks, i.e., the minimum number of
tokens required to assert detection with significance and robustness against
post-editing, is still debatable. In this paper, we propose, Duwak, to
fundamentally enhance the efficiency and quality of watermarking by embedding
dual secret patterns in both token probability distribution and sampling
schemes. To mitigate expression degradation caused by biasing toward certain
tokens, we design a contrastive search to watermark the sampling scheme, which
minimizes the token repetition and enhances the diversity. We theoretically
explain the interdependency of the two watermarks within Duwak. We evaluate
Duwak extensively on Llama2 under various post-editing attacks, against four
state-of-the-art watermarking techniques and combinations of them. Our results
show that Duwak marked text achieves the highest watermarked text quality at
the lowest required token count for detection, up to 70% tokens less than
existing approaches, especially under post paraphrasing.
♻ ☆ Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love, Paul Voigtlaender, Rohan Jain, Gabriela Surita, Kareem Mohamed, Rory Blevins, Junwhan Ahn, Tao Zhu, Kornraphop Kawintiranon, Orhan Firat, Yiming Gu, Yujing Zhang, Matthew Rahtz, Manaal Faruqui, Natalie Clay, Justin Gilmer, JD Co-Reyes, Ivo Penchev, Rui Zhu, Nobuyuki Morioka, Kevin Hui, Krishna Haridasan, Victor Campos, Mahdis Mahdieh, Mandy Guo, Samer Hassan, Kevin Kilgour, Arpi Vezer, Heng-Tze Cheng, Raoul de Liedekerke, Siddharth Goyal, Paul Barham, DJ Strouse, Seb Noury, Jonas Adler, Mukund Sundararajan, Sharad Vikram, Dmitry Lepikhin, Michela Paganini, Xavier Garcia, Fan Yang, Dasha Valter, Maja Trebacz, Kiran Vodrahalli, Chulayuth Asawaroengchai, Roman Ring, Norbert Kalb, Livio Baldini Soares, Siddhartha Brahma, David Steiner, Tianhe Yu, Fabian Mentzer, Antoine He, Lucas Gonzalez, Bibo Xu, Raphael Lopez Kaufman, Laurent El Shafey, Junhyuk Oh, Tom Hennigan, George van den Driessche, Seth Odoom, Mario Lucic, Becca Roelofs, Sid Lall, Amit Marathe, Betty Chan, Santiago Ontanon, Luheng He, Denis Teplyashin, Jonathan Lai, Phil Crone, Bogdan Damoc, Lewis Ho, Sebastian Riedel, Karel Lenc, Chih-Kuan Yeh, Aakanksha Chowdhery, Yang Xu, Mehran Kazemi, Ehsan Amid, Anastasia Petrushkina, Kevin Swersky, Ali Khodaei, Gowoon Chen, Chris Larkin, Mario Pinto, Geng Yan, Adria Puigdomenech Badia, Piyush Patil, Steven Hansen, Dave Orr, Sebastien M. R. Arnold, Jordan Grimstad, Andrew Dai, Sholto Douglas, Rishika Sinha, Vikas Yadav, Xi Chen, Elena Gribovskaya, Jacob Austin, Jeffrey Zhao, Kaushal Patel, Paul Komarek, Sophia Austin, Sebastian Borgeaud, Linda Friso, Abhimanyu Goyal, Ben Caine, Kris Cao, Da-Woon Chung, Matthew Lamm, Gabe Barth-Maron, Thais Kagohara, Kate Olszewska, Mia Chen, Kaushik Shivakumar, Rishabh Agarwal, Harshal Godhia, Ravi Rajwar, Javier Snaider, Xerxes Dotiwalla, Yuan Liu, Aditya Barua, Victor Ungureanu, Yuan Zhang, Bat-Orgil Batsaikhan, Mateo Wirth, James Qin, Ivo Danihelka, Tulsee Doshi, Martin Chadwick, Jilin Chen, Sanil Jain, Quoc Le, Arjun Kar, Madhu Gurumurthy, Cheng Li, Ruoxin Sang, Fangyu Liu, Lampros Lamprou, Rich Munoz, Nathan Lintz, Harsh Mehta, Heidi Howard, Malcolm Reynolds, Lora Aroyo, Quan Wang, Lorenzo Blanco, Albin Cassirer, Jordan Griffith, Dipanjan Das, Stephan Lee, Jakub Sygnowski, Zach Fisher, James Besley, Richard Powell, Zafarali Ahmed, Dominik Paulus, David Reitter, Zalan Borsos, Rishabh Joshi, Aedan Pope, Steven Hand, Vittorio Selo, Vihan Jain, Nikhil Sethi, Megha Goel, Takaki Makino, Rhys May, Zhen Yang, Johan Schalkwyk, Christina Butterfield, Anja Hauth, Alex Goldin, Will Hawkins, Evan Senter, Sergey Brin, Oliver Woodman, Marvin Ritter, Eric Noland, Minh Giang, Vijay Bolina, Lisa Lee, Tim Blyth, Ian Mackinnon, Machel Reid, Obaid Sarvana, David Silver, Alexander Chen, Lily Wang, Loren Maggiore, Oscar Chang, Nithya Attaluri, Gregory Thornton, Chung-Cheng Chiu, Oskar Bunyan, Nir Levine, Timothy Chung, Evgenii Eltyshev, Xiance Si, Timothy Lillicrap, Demetra Brady, Vaibhav Aggarwal, Boxi Wu, Yuanzhong Xu, Ross McIlroy, Kartikeya Badola, Paramjit Sandhu, Erica Moreira, Wojciech Stokowiec, Ross Hemsley, Dong Li, Alex Tudor, Pranav Shyam, Elahe Rahimtoroghi, Salem Haykal, Pablo Sprechmann, Xiang Zhou, Diana Mincu, Yujia Li, Ravi Addanki, Kalpesh Krishna, Xiao Wu, Alexandre Frechette, Matan Eyal, Allan Dafoe, Dave Lacey, Jay Whang, Thi Avrahami, Ye Zhang, Emanuel Taropa, Hanzhao Lin, Daniel Toyama, Eliza Rutherford, Motoki Sano, HyunJeong Choe, Alex Tomala, Chalence Safranek-Shrader, Nora Kassner, Mantas Pajarskas, Matt Harvey, Sean Sechrist, Meire Fortunato, Christina Lyu, Gamaleldin Elsayed, Chenkai Kuang, James Lottes, Eric Chu, Chao Jia, Chih-Wei Chen, Peter Humphreys, Kate Baumli, Connie Tao, Rajkumar Samuel, Cicero Nogueira dos Santos, Anders Andreassen, Nemanja Rakićević, Dominik Grewe, Aviral Kumar, Stephanie Winkler, Jonathan Caton, Andrew Brock, Sid Dalmia, Hannah Sheahan, Iain Barr, Yingjie Miao, Paul Natsev, Jacob Devlin, Feryal Behbahani, Flavien Prost, Yanhua Sun, Artiom Myaskovsky, Thanumalayan Sankaranarayana Pillai, Dan Hurt, Angeliki Lazaridou, Xi Xiong, Ce Zheng, Fabio Pardo, Xiaowei Li, Dan Horgan, Joe Stanton, Moran Ambar, Fei Xia, Alejandro Lince, Mingqiu Wang, Basil Mustafa, Albert Webson, Hyo Lee, Rohan Anil, Martin Wicke, Timothy Dozat, Abhishek Sinha, Enrique Piqueras, Elahe Dabir, Shyam Upadhyay, Anudhyan Boral, Lisa Anne Hendricks, Corey Fry, Josip Djolonga, Yi Su, Jake Walker, Jane Labanowski, Ronny Huang, Vedant Misra, Jeremy Chen, RJ Skerry-Ryan, Avi Singh, Shruti Rijhwani, Dian Yu, Alex Castro-Ros, Beer Changpinyo, Romina Datta, Sumit Bagri, Arnar Mar Hrafnkelsson, Marcello Maggioni, Daniel Zheng, Yury Sulsky, Shaobo Hou, Tom Le Paine, Antoine Yang, Jason Riesa, Dominika Rogozinska, Dror Marcus, Dalia El Badawy, Qiao Zhang, Luyu Wang, Helen Miller, Jeremy Greer, Lars Lowe Sjos, Azade Nova, Heiga Zen, Rahma Chaabouni, Mihaela Rosca, Jiepu Jiang, Charlie Chen, Ruibo Liu, Tara Sainath, Maxim Krikun, Alex Polozov, Jean-Baptiste Lespiau, Josh Newlan, Zeyncep Cankara, Soo Kwak, Yunhan Xu, Phil Chen, Andy Coenen, Clemens Meyer, Katerina Tsihlas, Ada Ma, Juraj Gottweis, Jinwei Xing, Chenjie Gu, Jin Miao, Christian Frank, Zeynep Cankara, Sanjay Ganapathy, Ishita Dasgupta, Steph Hughes-Fitt, Heng Chen, David Reid, Keran Rong, Hongmin Fan, Joost van Amersfoort, Vincent Zhuang, Aaron Cohen, Shixiang Shane Gu, Anhad Mohananey, Anastasija Ilic, Taylor Tobin, John Wieting, Anna Bortsova, Phoebe Thacker, Emma Wang, Emily Caveness, Justin Chiu, Eren Sezener, Alex Kaskasoli, Steven Baker, Katie Millican, Mohamed Elhawaty, Kostas Aisopos, Carl Lebsack, Nathan Byrd, Hanjun Dai, Wenhao Jia, Matthew Wiethoff, Elnaz Davoodi, Albert Weston, Lakshman Yagati, Arun Ahuja, Isabel Gao, Golan Pundak, Susan Zhang, Michael Azzam, Khe Chai Sim, Sergi Caelles, James Keeling, Abhanshu Sharma, Andy Swing, YaGuang Li, Chenxi Liu, Carrie Grimes Bostock, Yamini Bansal, Zachary Nado, Ankesh Anand, Josh Lipschultz, Abhijit Karmarkar, Lev Proleev, Abe Ittycheriah, Soheil Hassas Yeganeh, George Polovets, Aleksandra Faust, Jiao Sun, Alban Rrustemi, Pen Li, Rakesh Shivanna, Jeremiah Liu, Chris Welty, Federico Lebron, Anirudh Baddepudi, Sebastian Krause, Emilio Parisotto, Radu Soricut, Zheng Xu, Dawn Bloxwich, Melvin Johnson, Behnam Neyshabur, Justin Mao-Jones, Renshen Wang, Vinay Ramasesh, Zaheer Abbas, Arthur Guez, Constant Segal, Duc Dung Nguyen, James Svensson, Le Hou, Sarah York, Kieran Milan, Sophie Bridgers, Wiktor Gworek, Marco Tagliasacchi, James Lee-Thorp, Michael Chang, Alexey Guseynov, Ale Jakse Hartman, Michael Kwong, Ruizhe Zhao, Sheleem Kashem, Elizabeth Cole, Antoine Miech, Richard Tanburn, Mary Phuong, Filip Pavetic, Sebastien Cevey, Ramona Comanescu, Richard Ives, Sherry Yang, Cosmo Du, Bo Li, Zizhao Zhang, Mariko Iinuma, Clara Huiyi Hu, Aurko Roy, Shaan Bijwadia, Zhenkai Zhu, Danilo Martins, Rachel Saputro, Anita Gergely, Steven Zheng, Dawei Jia, Ioannis Antonoglou, Adam Sadovsky, Shane Gu, Yingying Bi, Alek Andreev, Sina Samangooei, Mina Khan, Tomas Kocisky, Angelos Filos, Chintu Kumar, Colton Bishop, Adams Yu, Sarah Hodkinson, Sid Mittal, Premal Shah, Alexandre Moufarek, Yong Cheng, Adam Bloniarz, Jaehoon Lee, Pedram Pejman, Paul Michel, Stephen Spencer, Vladimir Feinberg, Xuehan Xiong, Nikolay Savinov, Charlotte Smith, Siamak Shakeri, Dustin Tran, Mary Chesus, Bernd Bohnet, George Tucker, Tamara von Glehn, Carrie Muir, Yiran Mao, Hideto Kazawa, Ambrose Slone, Kedar Soparkar, Disha Shrivastava, James Cobon-Kerr, Michael Sharman, Jay Pavagadhi, Carlos Araya, Karolis Misiunas, Nimesh Ghelani, Michael Laskin, David Barker, Qiujia Li, Anton Briukhov, Neil Houlsby, Mia Glaese, Balaji Lakshminarayanan, Nathan Schucher, Yunhao Tang, Eli Collins, Hyeontaek Lim, Fangxiaoyu Feng, Adria Recasens, Guangda Lai, Alberto Magni, Nicola De Cao, Aditya Siddhant, Zoe Ashwood, Jordi Orbay, Mostafa Dehghani, Jenny Brennan, Yifan He, Kelvin Xu, Yang Gao, Carl Saroufim, James Molloy, Xinyi Wu, Seb Arnold, Solomon Chang, Julian Schrittwieser, Elena Buchatskaya, Soroush Radpour, Martin Polacek, Skye Giordano, Ankur Bapna, Simon Tokumine, Vincent Hellendoorn, Thibault Sottiaux, Sarah Cogan, Aliaksei Severyn, Mohammad Saleh, Shantanu Thakoor, Laurent Shefey, Siyuan Qiao, Meenu Gaba, Shuo-yiin Chang, Craig Swanson, Biao Zhang, Benjamin Lee, Paul Kishan Rubenstein, Gan Song, Tom Kwiatkowski, Anna Koop, Ajay Kannan, David Kao, Parker Schuh, Axel Stjerngren, Golnaz Ghiasi, Gena Gibson, Luke Vilnis, Ye Yuan, Felipe Tiengo Ferreira, Aishwarya Kamath, Ted Klimenko, Ken Franko, Kefan Xiao, Indro Bhattacharya, Miteyan Patel, Rui Wang, Alex Morris, Robin Strudel, Vivek Sharma, Peter Choy, Sayed Hadi Hashemi, Jessica Landon, Mara Finkelstein, Priya Jhakra, Justin Frye, Megan Barnes, Matthew Mauger, Dennis Daun, Khuslen Baatarsukh, Matthew Tung, Wael Farhan, Henryk Michalewski, Fabio Viola, Felix de Chaumont Quitry, Charline Le Lan, Tom Hudson, Qingze Wang, Felix Fischer, Ivy Zheng, Elspeth White, Anca Dragan, Jean-baptiste Alayrac, Eric Ni, Alexander Pritzel, Adam Iwanicki, Michael Isard, Anna Bulanova, Lukas Zilka, Ethan Dyer, Devendra Sachan, Srivatsan Srinivasan, Hannah Muckenhirn, Honglong Cai, Amol Mandhane, Mukarram Tariq, Jack W. Rae, Gary Wang, Kareem Ayoub, Nicholas FitzGerald, Yao Zhao, Woohyun Han, Chris Alberti, Dan Garrette, Kashyap Krishnakumar, Mai Gimenez, Anselm Levskaya, Daniel Sohn, Josip Matak, Inaki Iturrate, Michael B. Chang, Jackie Xiang, Yuan Cao, Nishant Ranka, Geoff Brown, Adrian Hutter, Vahab Mirrokni, Nanxin Chen, Kaisheng Yao, Zoltan Egyed, Francois Galilee, Tyler Liechty, Praveen Kallakuri, Evan Palmer, Sanjay Ghemawat, Jasmine Liu, David Tao, Chloe Thornton, Tim Green, Mimi Jasarevic, Sharon Lin, Victor Cotruta, Yi-Xuan Tan, Noah Fiedel, Hongkun Yu, Ed Chi, Alexander Neitz, Jens Heitkaemper, Anu Sinha, Denny Zhou, Yi Sun, Charbel Kaed, Brice Hulse, Swaroop Mishra, Maria Georgaki, Sneha Kudugunta, Clement Farabet, Izhak Shafran, Daniel Vlasic, Anton Tsitsulin, Rajagopal Ananthanarayanan, Alen Carin, Guolong Su, Pei Sun, Shashank V, Gabriel Carvajal, Josef Broder, Iulia Comsa, Alena Repina, William Wong, Warren Weilun Chen, Peter Hawkins, Egor Filonov, Lucia Loher, Christoph Hirnschall, Weiyi Wang, Jingchen Ye, Andrea Burns, Hardie Cate, Diana Gage Wright, Federico Piccinini, Lei Zhang, Chu-Cheng Lin, Ionel Gog, Yana Kulizhskaya, Ashwin Sreevatsa, Shuang Song, Luis C. Cobo, Anand Iyer, Chetan Tekur, Guillermo Garrido, Zhuyun Xiao, Rupert Kemp, Huaixiu Steven Zheng, Hui Li, Ananth Agarwal, Christel Ngani, Kati Goshvadi, Rebeca Santamaria-Fernandez, Wojciech Fica, Xinyun Chen, Chris Gorgolewski, Sean Sun, Roopal Garg, Xinyu Ye, S. M. Ali Eslami, Nan Hua, Jon Simon, Pratik Joshi, Yelin Kim, Ian Tenney, Sahitya Potluri, Lam Nguyen Thiet, Quan Yuan, Florian Luisier, Alexandra Chronopoulou, Salvatore Scellato, Praveen Srinivasan, Minmin Chen, Vinod Koverkathu, Valentin Dalibard, Yaming Xu, Brennan Saeta, Keith Anderson, Thibault Sellam, Nick Fernando, Fantine Huot, Junehyuk Jung, Mani Varadarajan, Michael Quinn, Amit Raul, Maigo Le, Ruslan Habalov, Jon Clark, Komal Jalan, Kalesha Bullard, Achintya Singhal, Thang Luong, Boyu Wang, Sujeevan Rajayogam, Julian Eisenschlos, Johnson Jia, Daniel Finchelstein, Alex Yakubovich, Daniel Balle, Michael Fink, Sameer Agarwal, Jing Li, Dj Dvijotham, Shalini Pal, Kai Kang, Jaclyn Konzelmann, Jennifer Beattie, Olivier Dousse, Diane Wu, Remi Crocker, Chen Elkind, Siddhartha Reddy Jonnalagadda, Jong Lee, Dan Holtmann-Rice, Krystal Kallarackal, Rosanne Liu, Denis Vnukov, Neera Vats, Luca Invernizzi, Mohsen Jafari, Huanjie Zhou, Lilly Taylor, Jennifer Prendki, Marcus Wu, Tom Eccles, Tianqi Liu, Kavya Kopparapu, Francoise Beaufays, Christof Angermueller, Andreea Marzoca, Shourya Sarcar, Hilal Dib, Jeff Stanway, Frank Perbet, Nejc Trdin, Rachel Sterneck, Andrey Khorlin, Dinghua Li, Xihui Wu, Sonam Goenka, David Madras, Sasha Goldshtein, Willi Gierke, Tong Zhou, Yaxin Liu, Yannie Liang, Anais White, Yunjie Li, Shreya Singh, Sanaz Bahargam, Mark Epstein, Sujoy Basu, Li Lao, Adnan Ozturel, Carl Crous, Alex Zhai, Han Lu, Zora Tung, Neeraj Gaur, Alanna Walton, Lucas Dixon, Ming Zhang, Amir Globerson, Grant Uy, Andrew Bolt, Olivia Wiles, Milad Nasr, Ilia Shumailov, Marco Selvi, Francesco Piccinno, Ricardo Aguilar, Sara McCarthy, Misha Khalman, Mrinal Shukla, Vlado Galic, John Carpenter, Kevin Villela, Haibin Zhang, Harry Richardson, James Martens, Matko Bosnjak, Shreyas Rammohan Belle, Jeff Seibert, Mahmoud Alnahlawi, Brian McWilliams, Sankalp Singh, Annie Louis, Wen Ding, Dan Popovici, Lenin Simicich, Laura Knight, Pulkit Mehta, Nishesh Gupta, Chongyang Shi, Saaber Fatehi, Jovana Mitrovic, Alex Grills, Joseph Pagadora, Dessie Petrova, Danielle Eisenbud, Zhishuai Zhang, Damion Yates, Bhavishya Mittal, Nilesh Tripuraneni, Yannis Assael, Thomas Brovelli, Prateek Jain, Mihajlo Velimirovic, Canfer Akbulut, Jiaqi Mu, Wolfgang Macherey, Ravin Kumar, Jun Xu, Haroon Qureshi, Gheorghe Comanici, Jeremy Wiesner, Zhitao Gong, Anton Ruddock, Matthias Bauer, Nick Felt, Anirudh GP, Anurag Arnab, Dustin Zelle, Jonas Rothfuss, Bill Rosgen, Ashish Shenoy, Bryan Seybold, Xinjian Li, Jayaram Mudigonda, Goker Erdogan, Jiawei Xia, Jiri Simsa, Andrea Michi, Yi Yao, Christopher Yew, Steven Kan, Isaac Caswell, Carey Radebaugh, Andre Elisseeff, Pedro Valenzuela, Kay McKinney, Kim Paterson, Albert Cui, Eri Latorre-Chimoto, Solomon Kim, William Zeng, Ken Durden, Priya Ponnapalli, Tiberiu Sosea, Christopher A. Choquette-Choo, James Manyika, Brona Robenek, Harsha Vashisht, Sebastien Pereira, Hoi Lam, Marko Velic, Denese Owusu-Afriyie, Katherine Lee, Tolga Bolukbasi, Alicia Parrish, Shawn Lu, Jane Park, Balaji Venkatraman, Alice Talbert, Lambert Rosique, Yuchung Cheng, Andrei Sozanschi, Adam Paszke, Praveen Kumar, Jessica Austin, Lu Li, Khalid Salama, Wooyeol Kim, Nandita Dukkipati, Anthony Baryshnikov, Christos Kaplanis, XiangHai Sheng, Yuri Chervonyi, Caglar Unlu, Diego de Las Casas, Harry Askham, Kathryn Tunyasuvunakool, Felix Gimeno, Siim Poder, Chester Kwak, Matt Miecnikowski, Vahab Mirrokni, Alek Dimitriev, Aaron Parisi, Dangyi Liu, Tomy Tsai, Toby Shevlane, Christina Kouridi, Drew Garmon, Adrian Goedeckemeyer, Adam R. Brown, Anitha Vijayakumar, Ali Elqursh, Sadegh Jazayeri, Jin Huang, Sara Mc Carthy, Jay Hoover, Lucy Kim, Sandeep Kumar, Wei Chen, Courtney Biles, Garrett Bingham, Evan Rosen, Lisa Wang, Qijun Tan, David Engel, Francesco Pongetti, Dario de Cesare, Dongseong Hwang, Lily Yu, Jennifer Pullman, Srini Narayanan, Kyle Levin, Siddharth Gopal, Megan Li, Asaf Aharoni, Trieu Trinh, Jessica Lo, Norman Casagrande, Roopali Vij, Loic Matthey, Bramandia Ramadhana, Austin Matthews, CJ Carey, Matthew Johnson, Kremena Goranova, Rohin Shah, Shereen Ashraf, Kingshuk Dasgupta, Rasmus Larsen, Yicheng Wang, Manish Reddy Vuyyuru, Chong Jiang, Joana Ijazi, Kazuki Osawa, Celine Smith, Ramya Sree Boppana, Taylan Bilal, Yuma Koizumi, Ying Xu, Yasemin Altun, Nir Shabat, Ben Bariach, Alex Korchemniy, Kiam Choo, Olaf Ronneberger, Chimezie Iwuanyanwu, Shubin Zhao, David Soergel, Cho-Jui Hsieh, Irene Cai, Shariq Iqbal, Martin Sundermeyer, Zhe Chen, Elie Bursztein, Chaitanya Malaviya, Fadi Biadsy, Prakash Shroff, Inderjit Dhillon, Tejasi Latkar, Chris Dyer, Hannah Forbes, Massimo Nicosia, Vitaly Nikolaev, Somer Greene, Marin Georgiev, Pidong Wang, Nina Martin, Hanie Sedghi, John Zhang, Praseem Banzal, Doug Fritz, Vikram Rao, Xuezhi Wang, Jiageng Zhang, Viorica Patraucean, Dayou Du, Igor Mordatch, Ivan Jurin, Lewis Liu, Ayush Dubey, Abhi Mohan, Janek Nowakowski, Vlad-Doru Ion, Nan Wei, Reiko Tojo, Maria Abi Raad, Drew A. Hudson, Vaishakh Keshava, Shubham Agrawal, Kevin Ramirez, Zhichun Wu, Hoang Nguyen, Ji Liu, Madhavi Sewak, Bryce Petrini, DongHyun Choi, Ivan Philips, Ziyue Wang, Ioana Bica, Ankush Garg, Jarek Wilkiewicz, Priyanka Agrawal, Xiaowei Li, Danhao Guo, Emily Xue, Naseer Shaik, Andrew Leach, Sadh MNM Khan, Julia Wiesinger, Sammy Jerome, Abhishek Chakladar, Alek Wenjiao Wang, Tina Ornduff, Folake Abu, Alireza Ghaffarkhah, Marcus Wainwright, Mario Cortes, Frederick Liu, Joshua Maynez, Andreas Terzis, Pouya Samangouei, Riham Mansour, Tomasz Kępa, François-Xavier Aubet, Anton Algymr, Dan Banica, Agoston Weisz, Andras Orban, Alexandre Senges, Ewa Andrejczuk, Mark Geller, Niccolo Dal Santo, Valentin Anklin, Majd Al Merey, Martin Baeuml, Trevor Strohman, Junwen Bai, Slav Petrov, Yonghui Wu, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, Oriol Vinyals
In this report, we introduce the Gemini 1.5 family of models, representing
the next generation of highly compute-efficient multimodal models capable of
recalling and reasoning over fine-grained information from millions of tokens
of context, including multiple long documents and hours of video and audio. The
family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds
the February version on the great majority of capabilities and benchmarks; (2)
Gemini 1.5 Flash, a more lightweight variant designed for efficiency with
minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on
long-context retrieval tasks across modalities, improve the state-of-the-art in
long-document QA, long-video QA and long-context ASR, and match or surpass
Gemini 1.0 Ultra's state-of-the-art performance across a broad set of
benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find
continued improvement in next-token prediction and near-perfect retrieval
(>99%) up to at least 10M tokens, a generational leap over existing models such
as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world
use cases, such as Gemini 1.5 collaborating with professionals on completing
their tasks achieving 26 to 75% time savings across 10 different job
categories, as well as surprising new capabilities of large language models at
the frontier; when given a grammar manual for Kalamang, a language with fewer
than 200 speakers worldwide, the model learns to translate English to Kalamang
at a similar level to a person who learned from the same content.
♻ ☆ Scalable Model Editing via Customized Expert Networks
Addressing the issues of hallucinations and outdated knowledge in large
language models is critical for their reliable application. Model Editing
presents a promising avenue for mitigating these challenges in a cost-effective
manner. However, existing methods often suffer from unsatisfactory
generalization and unintended effects on non-edited samples. To overcome these
limitations, we introduce a novel approach: Scalable Model Editing via
Customized Expert Networks (SCEN), which is a two-stage continuous training
paradigm. Specifically, in the first stage, we train lightweight expert
networks individually for each piece of knowledge that needs to be updated.
Subsequently, we train a corresponding indexing neuron for each expert to
control the activation state of that expert. We conducted a series of
experiments on the ZsRE and Hallucination benchmarks by tuning the advanced
open-source LLM, Llama2, achieving state-of-the-art results compared to current
mainstream methods. Our code is available at
https://github.com/TAL-auroraX/SCEN.
comment: Accepted by COLM2024
♻ ☆ Research Trends for the Interplay between Large Language Models and Knowledge Graphs
Hanieh Khorashadizadeh, Fatima Zahra Amara, Morteza Ezzabady, Frédéric Ieng, Sanju Tiwari, Nandana Mihindukulasooriya, Jinghua Groppe, Soror Sahri, Farah Benamara, Sven Groppe
This survey investigates the synergistic relationship between Large Language
Models (LLMs) and Knowledge Graphs (KGs), which is crucial for advancing AI's
capabilities in understanding, reasoning, and language processing. It aims to
address gaps in current research by exploring areas such as KG Question
Answering, ontology generation, KG validation, and the enhancement of KG
accuracy and consistency through LLMs. The paper further examines the roles of
LLMs in generating descriptive texts and natural language queries for KGs.
Through a structured analysis that includes categorizing LLM-KG interactions,
examining methodologies, and investigating collaborative uses and potential
biases, this study seeks to provide new insights into the combined potential of
LLMs and KGs. It highlights the importance of their interaction for improving
AI applications and outlines future research directions.
♻ ☆ The Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums
Vanessa Clairoux-Trepanier, Isa-May Beauchamp, Estelle Ruellan, Masarah Paquet-Clouston, Serge-Olivier Paquette, Eric Clay
Large language models (LLMs) can be used to analyze cyber threat intelligence
(CTI) data from cybercrime forums, which contain extensive information and key
discussions about emerging cyber threats. However, to date, the level of
accuracy and efficiency of LLMs for such critical tasks has yet to be
thoroughly evaluated. Hence, this study assesses the accuracy of an LLM system
built on the OpenAI GPT-3.5-turbo model [7] to extract CTI information. To do
so, a random sample of 500 daily conversations from three cybercrime forums,
XSS, Exploit_in, and RAMP, was extracted, and the LLM system was instructed to
summarize the conversations and code 10 key CTI variables, such as whether a
large organization and/or a critical infrastructure is being targeted. Then,
two coders reviewed each conversation and evaluated whether the information
extracted by the LLM was accurate. The LLM system performed strikingly well,
with an average accuracy score of 98%. Various ways to enhance the model were
uncovered, such as the need to help the LLM distinguish between stories and
past events, as well as being careful with verb tenses in prompts.
Nevertheless, the results of this study highlight the efficiency and relevance
of using LLMs for cyber threat intelligence.
♻ ☆ It Couldn't Help But Overhear: On the Limits of Modelling Meta-Communicative Grounding Acts with Supervised Learning
Active participation in a conversation is key to building common ground,
since understanding is jointly tailored by producers and recipients.
Overhearers are deprived of the privilege of performing grounding acts and can
only conjecture about intended meanings. Still, data generation and annotation,
modelling, training and evaluation of NLP dialogue models place reliance on the
overhearing paradigm. How much of the underlying grounding processes are
thereby forfeited? As we show, there is evidence pointing to the impossibility
of properly modelling human meta-communicative acts with data-driven learning
models. In this paper, we discuss this issue and provide a preliminary analysis
on the variability of human decisions for requesting clarification. Most
importantly, we wish to bring this topic back to the community's table,
encouraging discussion on the consequences of having models designed to only
"listen in".
comment: Accepted to SIGdial 2024
♻ ☆ U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF
Scale has opened new frontiers in natural language processing, but at a high
cost. In response, by learning to only activate a subset of parameters in
training and inference, Mixture-of-Experts (MoE) have been proposed as an
energy efficient path to even larger and more capable language models and this
shift towards a new generation of foundation models is gaining momentum,
particularly within the field of Automatic Speech Recognition (ASR). Recent
works that incorporating MoE into ASR models have complex designs such as
routing frames via supplementary embedding network, improving multilingual
ability for the experts, and utilizing dedicated auxiliary losses for either
expert load balancing or specific language handling. We found that delicate
designs are not necessary, while an embarrassingly simple substitution of MoE
layers for all Feed-Forward Network (FFN) layers is competent for the ASR task.
To be more specific, we benchmark our proposed model on a large scale
inner-source dataset (160k hours), the results show that we can scale our
baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve
Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real
Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with
bidirectional attention decoders (U2++), we achieve the streaming and
non-streaming decoding modes in a single MoE based model, which we call U2++
MoE. We hope that our study can facilitate the research on scaling speech
foundation models without sacrificing deployment efficiency.
♻ ☆ Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback
Large Language Models (LLMs) often generate erroneous outputs, known as
hallucinations, due to their limitations in discerning questions beyond their
knowledge scope. While addressing hallucination has been a focal point in
research, previous efforts primarily concentrate on enhancing correctness
without giving due consideration to the significance of rejection mechanisms.
In this paper, we conduct a comprehensive examination of the role of rejection,
introducing the notion of model reliability along with corresponding metrics.
These metrics measure the model's ability to provide accurate responses while
adeptly rejecting questions exceeding its knowledge boundaries, thereby
minimizing hallucinations. To improve the inherent reliability of LLMs, we
present a novel alignment framework called Reinforcement Learning from
Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically
determine the model's knowledge boundary and trains a reliable reward model to
encourage the refusal of out-of-knowledge questions. Experimental results on
mathematical questions affirm the substantial efficacy of RLKF in significantly
enhancing LLM reliability.
♻ ☆ Learning Domain-Invariant Features for Out-of-Context News Detection
Out-of-context news is a common type of misinformation on online media
platforms. This involves posting a caption, alongside a mismatched news image.
Existing out-of-context news detection models only consider the scenario where
pre-labeled data is available for each domain, failing to address the
out-of-context news detection on unlabeled domains (e.g. news topics or
agencies). In this work, we therefore focus on domain adaptive out-of-context
news detection. In order to effectively adapt the detection model to unlabeled
news topics or agencies, we propose ConDA-TTA (Contrastive Domain Adaptation
with Test-Time Adaptation) which applies contrastive learning and maximum mean
discrepancy (MMD) to learn domain-invariant features. In addition, we leverage
test-time target domain statistics to further assist domain adaptation.
Experimental results show that our approach outperforms baselines in most
domain adaptation settings on two public datasets, by as much as 2.93% in F1
and 2.08% in accuracy.
♻ ☆ Machine Psychology
Thilo Hagendorff, Ishita Dasgupta, Marcel Binz, Stephanie C. Y. Chan, Andrew Lampinen, Jane X. Wang, Zeynep Akata, Eric Schulz
Large language models (LLMs) show increasingly advanced emergent capabilities
and are being incorporated across various societal domains. Understanding their
behavior and reasoning abilities therefore holds significant importance. We
argue that a fruitful direction for research is engaging LLMs in behavioral
experiments inspired by psychology that have traditionally been aimed at
understanding human cognition and behavior. In this article, we highlight and
summarize theoretical perspectives, experimental paradigms, and computational
analysis techniques that this approach brings to the table. It paves the way
for a "machine psychology" for generative artificial intelligence (AI) that
goes beyond performance benchmarks and focuses instead on computational
insights that move us toward a better understanding and discovery of emergent
abilities and behavioral patterns in LLMs. We review existing work taking this
approach, synthesize best practices, and highlight promising future directions.
We also highlight the important caveats of applying methodologies designed for
understanding humans to machines. We posit that leveraging tools from
experimental psychology to study AI will become increasingly valuable as models
evolve to be more powerful, opaque, multi-modal, and integrated into complex
real-world settings.
♻ ☆ A Survey on Mixture of Experts
Large language models (LLMs) have garnered unprecedented advancements across
diverse fields, ranging from natural language processing to computer vision and
beyond. The prowess of LLMs is underpinned by their substantial model size,
extensive and diverse datasets, and the vast computational power harnessed
during training, all of which contribute to the emergent abilities of LLMs
(e.g., in-context learning) that are not present in small models. Within this
context, the mixture of experts (MoE) has emerged as an effective method for
substantially scaling up model capacity with minimal computation overhead,
gaining significant attention from academia and industry. Despite its growing
prevalence, there lacks a systematic and comprehensive review of the literature
on MoE. This survey seeks to bridge that gap, serving as an essential resource
for researchers delving into the intricacies of MoE. We first briefly introduce
the structure of the MoE layer, followed by proposing a new taxonomy of MoE.
Next, we overview the core designs for various MoE models including both
algorithmic and systemic aspects, alongside collections of available
open-source implementations, hyperparameter configurations and empirical
evaluations. Furthermore, we delineate the multifaceted applications of MoE in
practice, and outline some potential directions for future research. To
facilitate ongoing updates and the sharing of cutting-edge developments in MoE
research, we have established a resource repository accessible at
https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts.
♻ ☆ Empowering Large Language Model Agents through Action Learning
Haiteng Zhao, Chang Ma, Guoyin Wang, Jing Su, Lingpeng Kong, Jingjing Xu, Zhi-Hong Deng, Hongxia Yang
Large Language Model (LLM) Agents have recently garnered increasing interest
yet they are limited in their ability to learn from trial and error, a key
element of intelligent behavior. In this work, we argue that the capacity to
learn new actions from experience is fundamental to the advancement of learning
in LLM agents. While humans naturally expand their action spaces and develop
skills through experiential learning, LLM agents typically operate within fixed
action spaces, limiting their potential for growth. To address these
challenges, our study explores open-action learning for language agents. We
introduce a framework LearnAct with an iterative learning strategy to create
and improve actions in the form of Python functions. In each iteration, LLM
revises and updates the currently available actions based on the errors
identified in unsuccessful training tasks, thereby enhancing action
effectiveness. Our experimental evaluations across Robotic Planning and
Alfworld environments reveal that after learning on a few training task
instances, our approach to open-action learning markedly improves agent
performance for the type of task (by 32 percent in AlfWorld compared to
ReAct+Reflexion, for instance) highlighting the importance of experiential
action learning in the development of more intelligent LLM agents.
comment: 9 pages
♻ ☆ MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models
Peng Ding, Jiading Fang, Peng Li, Kangrui Wang, Xiaochen Zhou, Mo Yu, Jing Li, Matthew R. Walter, Hongyuan Mei
Large language models such as ChatGPT and GPT-4 have recently achieved
astonishing performance on a variety of natural language processing tasks. In
this paper, we propose MANGO, a benchmark to evaluate their capabilities to
perform text-based mapping and navigation. Our benchmark includes 53 mazes
taken from a suite of textgames: each maze is paired with a walkthrough that
visits every location but does not cover all possible paths. The task is
question-answering: for each maze, a large language model reads the walkthrough
and answers hundreds of mapping and navigation questions such as "How should
you go to Attic from West of House?" and "Where are we if we go north and east
from Cellar?". Although these questions are easy to humans, it turns out that
even GPT-4, the best-to-date language model, performs poorly at answering them.
Further, our experiments suggest that a strong mapping and navigation ability
would benefit large language models in performing relevant downstream tasks,
such as playing textgames. Our MANGO benchmark will facilitate future research
on methods that improve the mapping and navigation capabilities of language
models. We host our leaderboard, data, code, and evaluation program at
https://mango.ttic.edu and https://github.com/oaklight/mango/.
comment: COLM 2024 camera-ready
♻ ☆ TarGEN: Targeted Data Generation with Large Language Models
Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra
The rapid advancement of large language models (LLMs) has sparked interest in
data synthesis techniques, aiming to generate diverse and high-quality
synthetic datasets. However, these synthetic datasets often suffer from a lack
of diversity and added noise. In this paper, we present TarGEN, a multi-step
prompting strategy for generating high-quality synthetic datasets utilizing a
LLM. An advantage of TarGEN is its seedless nature; it does not require
specific task instances, broadening its applicability beyond task replication.
We augment TarGEN with a method known as self-correction empowering LLMs to
rectify inaccurately labeled instances during dataset creation, ensuring
reliable labels. To assess our technique's effectiveness, we emulate 8 tasks
from the SuperGLUE benchmark and finetune various language models, including
encoder-only, encoder-decoder, and decoder-only models on both synthetic and
original training sets. Evaluation on the original test set reveals that models
trained on datasets generated by TarGEN perform approximately 1-2% points
better than those trained on original datasets (82.84% via syn. vs. 81.12% on
og. using Flan-T5). When incorporating instruction tuning, the performance
increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A
comprehensive analysis of the synthetic dataset compared to the original
dataset reveals that the synthetic dataset demonstrates similar or higher
levels of dataset complexity and diversity. Furthermore, the synthetic dataset
displays a bias level that aligns closely with the original dataset. Finally,
when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive
results on the OpenLLM leaderboard, surpassing the model trained on the
Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for
quality data generation and reducing the human efforts to create complex
benchmarks.
comment: COLM 2024, 35 pages
♻ ☆ 3M-Health: Multimodal Multi-Teacher Knowledge Distillation for Mental Health Detection CIKM 2024
The significance of mental health classification is paramount in contemporary
society, where digital platforms serve as crucial sources for monitoring
individuals' well-being. However, existing social media mental health datasets
primarily consist of text-only samples, potentially limiting the efficacy of
models trained on such data. Recognising that humans utilise cross-modal
information to comprehend complex situations or issues, we present a novel
approach to address the limitations of current methodologies. In this work, we
introduce a Multimodal and Multi-Teacher Knowledge Distillation model for
Mental Health Classification, leveraging insights from cross-modal human
understanding. Unlike conventional approaches that often rely on simple
concatenation to integrate diverse features, our model addresses the challenge
of appropriately representing inputs of varying natures (e.g., texts and
sounds). To mitigate the computational complexity associated with integrating
all features into a single model, we employ a multimodal and multi-teacher
architecture. By distributing the learning process across multiple teachers,
each specialising in a particular feature extraction aspect, we enhance the
overall mental health classification performance. Through experimental
validation, we demonstrate the efficacy of our model in achieving improved
performance.
comment: Accepted at CIKM 2024; Code will be made available at
https://github.com/adlnlp/3mhealth
♻ ☆ PersLLM: A Personified Training Approach for Large Language Models
Large language models exhibit aspects of human-level intelligence that
catalyze their application as human-like agents in domains such as social
simulations, human-machine interactions, and collaborative multi-agent systems.
However, the absence of distinct personalities, such as displaying ingratiating
behaviors, inconsistent opinions, and uniform response patterns, diminish LLMs
utility in practical applications. Addressing this, the development of
personality traits in LLMs emerges as a crucial area of research to unlock
their latent potential. Existing methods to personify LLMs generally involve
strategies like employing stylized training data for instruction tuning or
using prompt engineering to simulate different personalities. These methods
only capture superficial linguistic styles instead of the core of personalities
and are therefore not stable. In this study, we propose PersLLM, integrating
psychology-grounded principles of personality: social practice, consistency,
and dynamic development, into a comprehensive training methodology. We
incorporate personality traits directly into the model parameters, enhancing
the model's resistance to induction, promoting consistency, and supporting the
dynamic evolution of personality. Single-agent evaluation validates our
method's superiority, as it produces responses more aligned with reference
personalities compared to other approaches. Case studies for multi-agent
communication highlight its benefits in enhancing opinion consistency within
individual agents and fostering collaborative creativity among multiple agents
in dialogue contexts, potentially benefiting human simulation and multi-agent
cooperation. Additionally, human-agent interaction evaluations indicate that
our personified models significantly enhance interactive experiences,
underscoring the practical implications of our research.
comment: 10 pages for main text, 5 figures
♻ ☆ LLM Discussion: Enhancing the Creativity of Large Language Models via Discussion Framework and Role-Play
Large language models (LLMs) have shown exceptional proficiency in natural
language processing but often fall short of generating creative and original
responses to open-ended questions. To enhance LLM creativity, our key insight
is to emulate the human process of inducing collective creativity through
engaging discussions with participants from diverse backgrounds and
perspectives. To this end, we propose LLM Discussion, a three-phase discussion
framework that facilitates vigorous and diverging idea exchanges and ensures
convergence to creative answers. Moreover, we adopt a role-playing technique by
assigning distinct roles to LLMs to combat the homogeneity of LLMs. We evaluate
the efficacy of the proposed framework with the Alternative Uses Test,
Similarities Test, Instances Test, and Scientific Creativity Test through both
LLM evaluation and human study. The results show that our proposed framework
outperforms single-LLM approaches and existing multi-LLM frameworks across
various creativity metrics. The code is available at
https://github.com/lawraa/LLM-Discussion.
comment: 40 pages, 9 figures, COLM 2024
♻ ☆ EXAONE 3.0 7.8B Instruction Tuned Language Model
LG AI Research, :, Soyoung An, Kyunghoon Bae, Eunbi Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Yeonjung Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Euisoon Kim, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Moontae Lee, Seungjun Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Boseong Seo, Sihoon Yang, Heuiyeen Yeen, Kyungjae Yoo, Hyeongu Yun
We introduce EXAONE 3.0 instruction-tuned language model, the first open
model in the family of Large Language Models (LLMs) developed by LG AI
Research. Among different model sizes, we publicly release the 7.8B
instruction-tuned model to promote open research and innovations. Through
extensive evaluations across a wide range of public and in-house benchmarks,
EXAONE 3.0 demonstrates highly competitive real-world performance with
instruction-following capability against other state-of-the-art open models of
similar size. Our comparative analysis shows that EXAONE 3.0 excels
particularly in Korean, while achieving compelling performance across general
tasks and complex reasoning. With its strong real-world effectiveness and
bilingual proficiency, we hope that EXAONE keeps contributing to advancements
in Expert AI. Our EXAONE 3.0 instruction-tuned model is available at
https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct
♻ ☆ Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models ACL 2024
Incremental Learning (IL) has been a long-standing problem in both vision and
Natural Language Processing (NLP) communities. In recent years, as Pre-trained
Language Models (PLMs) have achieved remarkable progress in various NLP
downstream tasks, utilizing PLMs as backbones has become a common practice in
recent research of IL in NLP. Most assume that catastrophic forgetting is the
biggest obstacle to achieving superior IL performance and propose various
techniques to overcome this issue. However, we find that this assumption is
problematic. Specifically, we revisit more than 20 methods on four
classification tasks (Text Classification, Intent Classification, Relation
Extraction, and Named Entity Recognition) under the two most popular IL
settings (Class-Incremental and Task-Incremental) and reveal that most of them
severely underestimate the inherent anti-forgetting ability of PLMs. Based on
the observation, we propose a frustratingly easy method called SEQ* for IL with
PLMs. The results show that SEQ* has competitive or superior performance
compared to state-of-the-art (SOTA) IL methods and requires considerably less
trainable parameters and training time. These findings urge us to revisit the
IL with PLMs and encourage future studies to have a fundamental understanding
of the catastrophic forgetting in PLMs. The data, code and scripts are publicly
available at
https://github.com/zzz47zzz/codebase-for-incremental-learning-with-llm.
comment: ACL 2024 main conference (Oral)
♻ ☆ CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias
As language models (LMs) become increasingly powerful and widely used, it is
important to quantify them for sociodemographic bias with potential for harm.
Prior measures of bias are sensitive to perturbations in the templates designed
to compare performance across social groups, due to factors such as low
diversity or limited number of templates. Also, most previous work considers
only one NLP task. We introduce Comprehensive Assessment of Language Models
(CALM) for robust measurement of two types of universally relevant
sociodemographic bias, gender and race. CALM integrates sixteen datasets for
question-answering, sentiment analysis and natural language inference. Examples
from each dataset are filtered to produce 224 templates with high diversity
(e.g., length, vocabulary). We assemble 50 highly frequent person names for
each of seven distinct demographic groups to generate 78,400 prompts covering
the three NLP tasks. Our empirical evaluation shows that CALM bias scores are
more robust and far less sensitive than previous bias measurements to
perturbations in the templates, such as synonym substitution, or to random
subset selection of templates. We apply CALM to 20 large language models, and
find that for 2 language model series, larger parameter models tend to be more
biased than smaller ones. The T0 series is the least biased model families, of
the 20 LLMs investigated here. The code is available at
https://github.com/vipulgupta1011/CALM.
♻ ☆ LLMs Learn Task Heuristics from Demonstrations: A Heuristic-Driven Prompting Strategy for Document-Level Event Argument Extraction ACL 2024
In this study, we investigate in-context learning (ICL) in document-level
event argument extraction (EAE) to alleviate the dependency on large-scale
labeled data for this task. We introduce the Heuristic-Driven Link-of-Analogy
(HD-LoA) prompting to address the challenge of example selection and to develop
a prompting strategy tailored for EAE. Specifically, we hypothesize and
validate that LLMs learn task-specific heuristics from demonstrations via ICL.
Building upon this hypothesis, we introduce an explicit heuristic-driven
demonstration construction approach, which transforms the haphazard example
selection process into a methodical method that emphasizes task heuristics.
Additionally, inspired by the analogical reasoning of human, we propose the
link-of-analogy prompting, which enables LLMs to process new situations by
drawing analogies to known situations, enhancing their performance on unseen
classes beyond limited ICL examples. Experiments show that our method
outperforms existing prompting methods and few-shot supervised learning methods
on document-level EAE datasets. Additionally, the HD-LoA prompting shows
effectiveness in diverse tasks like sentiment analysis and natural language
inference, demonstrating its broad adaptability.
comment: Accepted to ACL 2024
♻ ☆ EMO-KNOW: A Large Scale Dataset on Emotion and Emotion-cause EMNLP 2023
Emotion-Cause analysis has attracted the attention of researchers in recent
years. However, most existing datasets are limited in size and number of
emotion categories. They often focus on extracting parts of the document that
contain the emotion cause and fail to provide more abstractive, generalizable
root cause. To bridge this gap, we introduce a large-scale dataset of emotion
causes, derived from 9.8 million cleaned tweets over 15 years. We describe our
curation process, which includes a comprehensive pipeline for data gathering,
cleaning, labeling, and validation, ensuring the dataset's reliability and
richness. We extract emotion labels and provide abstractive summarization of
the events causing emotions. The final dataset comprises over 700,000 tweets
with corresponding emotion-cause pairs spanning 48 emotion classes, validated
by human evaluators. The novelty of our dataset stems from its broad spectrum
of emotion classes and the abstractive emotion cause that facilitates the
development of an emotion-cause knowledge graph for nuanced reasoning. Our
dataset will enable the design of emotion-aware systems that account for the
diverse emotional responses of different people for the same event.
comment: Accepted to Findings of EMNLP 2023
♻ ☆ Tell Me What's Next: Textual Foresight for Generic UI Representations ACL 2024
Mobile app user interfaces (UIs) are rich with action, text, structure, and
image content that can be utilized to learn generic UI representations for
tasks like automating user commands, summarizing content, and evaluating the
accessibility of user interfaces. Prior work has learned strong visual
representations with local or global captioning losses, but fails to retain
both granularities. To combat this, we propose Textual Foresight, a novel
pretraining objective for learning UI screen representations. Textual Foresight
generates global text descriptions of future UI states given a current UI and
local action taken. Our approach requires joint reasoning over elements and
entire screens, resulting in improved UI features: on generation tasks, UI
agents trained with Textual Foresight outperform state-of-the-art by 2% with
28x fewer images. We train with our newly constructed mobile app dataset,
OpenApp, which results in the first public dataset for app UI representation
learning. OpenApp enables new baselines, and we find Textual Foresight improves
average task performance over them by 5.7% while having access to 2x less data.
comment: Accepted to ACL 2024 Findings. Data and code to be released at
https://github.com/aburns4/textualforesight
♻ ☆ Cross-domain Named Entity Recognition via Graph Matching ACL
Cross-domain NER is a practical yet challenging problem since the data
scarcity in the real-world scenario. A common practice is first to learn a NER
model in a rich-resource general domain and then adapt the model to specific
domains. Due to the mismatch problem between entity types across domains, the
wide knowledge in the general domain can not effectively transfer to the target
domain NER model. To this end, we model the label relationship as a probability
distribution and construct label graphs in both source and target label spaces.
To enhance the contextual representation with label structures, we fuse the
label graph into the word embedding output by BERT. By representing label
relationships as graphs, we formulate cross-domain NER as a graph matching
problem. Furthermore, the proposed method has good applicability with
pre-training methods and is potentially capable of other cross-domain
prediction tasks. Empirical results on four datasets show that our method
outperforms a series of transfer learning, multi-task learning, and few-shot
learning methods.
comment: Findings of ACL; available at Findings 2022
https://aclanthology.org/2022.findings-acl.210/; Improve presentation
♻ ☆ MAP's not dead yet: Uncovering true language model modes by conditioning away degeneracy ACL
It has been widely observed that exact or approximate MAP (mode-seeking)
decoding from natural language generation (NLG) models consistently leads to
degenerate outputs (Holtzman et al., 2019; Stahlberg and Byrne, 2019). Prior
work has attributed this behavior to either a fundamental and unavoidable
inadequacy of modes in probabilistic models or weaknesses in language modeling.
Contrastingly, we argue that degenerate modes can even occur in the absence of
any modeling error, due to contamination of the training data. Specifically, we
argue that mixing even a tiny amount of low-entropy noise with a population
text distribution can cause the data distribution's mode to become degenerate.
We therefore propose to apply MAP decoding to the model's true conditional
distribution where the conditioning variable explicitly avoids specific
degenerate behavior. Using exact search, we empirically verify that the
length-conditional modes of machine translation models and language models are
indeed more fluent and topical than their unconditional modes. For the first
time, we also share many examples of exact modal sequences from these models,
and from several variants of the LLaMA-7B model. Notably, we observe that
various kinds of degenerate modes persist, even at the scale of LLaMA-7B.
Although we cannot tractably address these degeneracies with exact search, we
perform a classifier-based approximate search on LLaMA-7B, a model which was
not trained for instruction following, and find that we are able to elicit
reasonable outputs without any finetuning.
comment: 52 pages, 5 figures, ACL version
♻ ☆ Fairness in Large Language Models in Three Hours
Large Language Models (LLMs) have demonstrated remarkable success across
various domains but often lack fairness considerations, potentially leading to
discriminatory outcomes against marginalized populations. Unlike fairness in
traditional machine learning, fairness in LLMs involves unique backgrounds,
taxonomies, and fulfillment techniques. This tutorial provides a systematic
overview of recent advances in the literature concerning fair LLMs, beginning
with real-world case studies to introduce LLMs, followed by an analysis of bias
causes therein. The concept of fairness in LLMs is then explored, summarizing
the strategies for evaluating bias and the algorithms designed to promote
fairness. Additionally, resources for assessing bias in LLMs, including
toolkits and datasets, are compiled, and current research challenges and open
questions in the field are discussed. The repository is available at
\url{https://github.com/LavinWong/Fairness-in-Large-Language-Models}.
♻ ☆ NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time ACL 2024
Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu
Large Language Models (LLMs) have ignited an innovative surge of AI
applications, marking a new era of exciting possibilities equipped with
extended context windows. However, hosting these models is cost-prohibitive
mainly due to the extensive memory consumption of KV Cache involving
long-context modeling. Despite several works proposing to evict unnecessary
tokens from the KV Cache, most of them rely on the biased local statistics of
accumulated attention scores and report performance using unconvincing metric
like perplexity on inadequate short-text evaluation. In this paper, we propose
NACL, a general framework for long-context KV cache eviction that achieves more
optimal and efficient eviction in a single operation during the encoding phase.
Due to NACL's efficiency, we combine more accurate attention score statistics
in PROXY TOKENS EVICTION with the diversified random eviction strategy of
RANDOM EVICTION, aiming to alleviate the issue of attention bias and enhance
the robustness in maintaining pivotal tokens for long-context modeling tasks.
Notably, our method significantly improves the performance on short- and
long-text tasks by 80% and 76% respectively, reducing KV Cache by up to 50%
with over 95% performance maintenance. The code is available at
https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2024-NACL.
comment: Accepted by ACL 2024 (main conference, long paper)
♻ ☆ PAGED: A Benchmark for Procedural Graphs Extraction from Documents ACL 2024
Automatic extraction of procedural graphs from documents creates a low-cost
way for users to easily understand a complex procedure by skimming visual
graphs. Despite the progress in recent studies, it remains unanswered: whether
the existing studies have well solved this task (Q1) and whether the emerging
large language models (LLMs) can bring new opportunities to this task (Q2). To
this end, we propose a new benchmark PAGED, equipped with a large high-quality
dataset and standard evaluations. It investigates five state-of-the-art
baselines, revealing that they fail to extract optimal procedural graphs well
because of their heavy reliance on hand-written rules and limited available
data. We further involve three advanced LLMs in PAGED and enhance them with a
novel self-refine strategy. The results point out the advantages of LLMs in
identifying textual elements and their gaps in building logical structures. We
hope PAGED can serve as a major landmark for automatic procedural graph
extraction and the investigations in PAGED can offer insights into the research
on logic reasoning among non-sequential elements.
comment: Accepted to The 62nd Annual Meeting of the Association for
Computational Linguistics (ACL 2024)
♻ ☆ CARE: A Clue-guided Assistant for CSRs to Read User Manuals ACL 2024
It is time-saving to build a reading assistant for customer service
representations (CSRs) when reading user manuals, especially information-rich
ones. Current solutions don't fit the online custom service scenarios well due
to the lack of attention to user questions and possible responses. Hence, we
propose to develop a time-saving and careful reading assistant for CSRs, named
CARE. It can help the CSRs quickly find proper responses from the user manuals
via explicit clue chains. Specifically, each of the clue chains is formed by
inferring over the user manuals, starting from the question clue aligned with
the user question and ending at a possible response. To overcome the shortage
of supervised data, we adopt the self-supervised strategy for model learning.
The offline experiment shows that CARE is efficient in automatically inferring
accurate responses from the user manual. The online experiment further
demonstrates the superiority of CARE to reduce CSRs' reading burden and keep
high service quality, in particular with >35% decrease in time spent and
keeping a >0.75 ICC score.
comment: Accepted to The 62nd Annual Meeting of the Association for
Computational Linguistics (ACL 2024)
♻ ☆ What Matters in Transformers? Not All Attention is Needed
Scaling Transformer-based large language models (LLMs) has demonstrated
promising performance across various tasks. However, it also introduces
redundant structures, posing challenges for real-world deployment. Despite some
recognition of redundancy in LLMs, the variability of redundancy across
different modules, such as MLP and Attention layers, is under-explored. In this
work, we investigate the varying redundancy across different modules within
Transformers, including Blocks, MLP, and Attention layers, using a
similarity-based metric. This metric operates on the premise that redundant
structures produce outputs highly similar to their inputs. Surprisingly, while
attention layers are essential for transformers and distinguish them from other
mainstream architectures, we found that a large proportion of attention layers
exhibit excessively high similarity and can be safely pruned without degrading
performance, leading to reduced memory and computation costs. Additionally, we
further propose a method that jointly drops Attention and MLP layers, achieving
improved performance and dropping ratios. Extensive experiments demonstrate the
effectiveness of our methods, e.g., Llama-3-70B maintains comparable
performance even after pruning half of the attention layers. Our findings
provide valuable insights for future network architecture design. The code is
released at: \url{https://github.com/Shwai-He/LLM-Drop}.
comment: 15 pages, 13 figures, 6 tables